[Pacemaker] Fixed! - Re: Problem with dual-PDU fencing node with redundant PSUs

Dejan Muhamedagic dejanmm at fastmail.fm
Thu Jul 4 10:06:09 EDT 2013


On Tue, Jul 02, 2013 at 10:53:50AM -0400, Digimer wrote:
> On 07/02/2013 04:02 AM, Dejan Muhamedagic wrote:
> > On Mon, Jul 01, 2013 at 11:53:29AM -0400, Digimer wrote:
> >> On 07/01/2013 04:52 AM, Dejan Muhamedagic wrote:
> >>> Right. It is often missed that actually more than one failure is
> >>> required for that setup to fail. In case of dual PDU/PSU/UPS an
> >>> IPMI based fencing is sufficient.
> >>
> >> You are right, of course. Imagine though that the IPMI BMC's network
> >> port or cable could have silently failed some time before the node
> >> failed. Yes, this is two simultaneous failues so not an overall SPoF,
> >> but likely enough that it should be addressed.
> >>
> >> If you've already setup redundant power, then it strikes me as fairly
> >> easy to use your PDUs as a backup fence method.
> >>
> >> Now all this said, you'll note in the mailing lists and IRC that I don't
> >> tell people they should have two methods. If people setup just IPMI
> >> fencing, I am happy. It's a question of how careful do you want/need to
> >> be, after that. For me, one fence method is not enough.
> > 
> > I suppose that you're supporting a few clusters. How often does
> > it happen that nodes get fenced? And why? And did you in those
> > cases needed to use the backup fence device?
> > 
> > Thanks,
> > 
> > Dejan
> 
> They occasionally get fenced, but it's very rare. Most were from an
> earlier configuration I no longer offer that were based on one switch
> (with redundant NICs in bond mode=1). The switch would hiccup and that
> would trigger fencing. Since I switched to dual switches, I've not had a
> network-triggered failure.
> 
> The most common problem I see, that my cluster saved people from, is
> power problems. These have never required fencing, but rather simply
> having two monitored UPSes has allowed us to detecting pending
> catastrophic power failures (a transformer blew up three days after we
> started seeing alerts, a faulty regulator in a customer's neighborhood,
> etc).

Right, I'd also guess that power failures are the most common in
the hardware category.

> We've also saved a customer's entire (small) DC when they lost AC and
> their own alerts failed (we saw a sudden rise in inlet temp and alerted
> the client.). One node at the top of the rack (out of four dual-node
> clusters) went into thermal shutdown and got fenced before we could shed
> enough load. They didn't lose any of their non-clustered servers though.
> 
> So to your question; have we ever needed the backup fencing in
> production? Nope, but I see it as just a matter of time. One user error,
> one bad UPS/battery pack, one tripped breaker and it will save us. When
> we demo our clusters to perspective customers, the most dramatic test we
> do is shut down the primary UPS. This takes out one of the switches, one
> of the dashboard appliances and forces the nodes to run on half their
> power. If this happened in production, then dual-PDUs would certainly
> save us.
> 
> Not my personal experience, but a sysadmin friend of mine had a case
> where a server's 12vDC wire was rubbing against a sharp piece of the
> chassis. Eventually it cut through the insulation and shorted out,
> taking the node's power off despite having redundant PSUs. Had this
> happened to our cluster, we'd have been saved by the backup fence device
> because the IPMI would have been lost.

There are also some light-out devices with battery backup
providing power for enough time for fencing to succeed.

> I've got ten or so customers around north america and I've only been
> doing this for four years or so. That I have not *yet* been saved by
> backup fencing in no way means it is not needed. :)

I'd really be interested in numbers which we don't have, that is
how much extra availability in a fully redundant power supply
setup a backup fencing device provides.

Of course, taking every possible precaution is commendable, but
in this case it seems like it introduces a level of complexity
which is hard to grasp for most of people (even those running
clusters).

Thanks,

Dejan

> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list