[Pacemaker] Fixed! - Re: Problem with dual-PDU fencing node with redundant PSUs
Digimer
lists at alteeve.ca
Tue Jul 2 14:53:50 UTC 2013
On 07/02/2013 04:02 AM, Dejan Muhamedagic wrote:
> On Mon, Jul 01, 2013 at 11:53:29AM -0400, Digimer wrote:
>> On 07/01/2013 04:52 AM, Dejan Muhamedagic wrote:
>>> Right. It is often missed that actually more than one failure is
>>> required for that setup to fail. In case of dual PDU/PSU/UPS an
>>> IPMI based fencing is sufficient.
>>
>> You are right, of course. Imagine though that the IPMI BMC's network
>> port or cable could have silently failed some time before the node
>> failed. Yes, this is two simultaneous failues so not an overall SPoF,
>> but likely enough that it should be addressed.
>>
>> If you've already setup redundant power, then it strikes me as fairly
>> easy to use your PDUs as a backup fence method.
>>
>> Now all this said, you'll note in the mailing lists and IRC that I don't
>> tell people they should have two methods. If people setup just IPMI
>> fencing, I am happy. It's a question of how careful do you want/need to
>> be, after that. For me, one fence method is not enough.
>
> I suppose that you're supporting a few clusters. How often does
> it happen that nodes get fenced? And why? And did you in those
> cases needed to use the backup fence device?
>
> Thanks,
>
> Dejan
They occasionally get fenced, but it's very rare. Most were from an
earlier configuration I no longer offer that were based on one switch
(with redundant NICs in bond mode=1). The switch would hiccup and that
would trigger fencing. Since I switched to dual switches, I've not had a
network-triggered failure.
The most common problem I see, that my cluster saved people from, is
power problems. These have never required fencing, but rather simply
having two monitored UPSes has allowed us to detecting pending
catastrophic power failures (a transformer blew up three days after we
started seeing alerts, a faulty regulator in a customer's neighborhood,
etc).
We've also saved a customer's entire (small) DC when they lost AC and
their own alerts failed (we saw a sudden rise in inlet temp and alerted
the client.). One node at the top of the rack (out of four dual-node
clusters) went into thermal shutdown and got fenced before we could shed
enough load. They didn't lose any of their non-clustered servers though.
So to your question; have we ever needed the backup fencing in
production? Nope, but I see it as just a matter of time. One user error,
one bad UPS/battery pack, one tripped breaker and it will save us. When
we demo our clusters to perspective customers, the most dramatic test we
do is shut down the primary UPS. This takes out one of the switches, one
of the dashboard appliances and forces the nodes to run on half their
power. If this happened in production, then dual-PDUs would certainly
save us.
Not my personal experience, but a sysadmin friend of mine had a case
where a server's 12vDC wire was rubbing against a sharp piece of the
chassis. Eventually it cut through the insulation and shorted out,
taking the node's power off despite having redundant PSUs. Had this
happened to our cluster, we'd have been saved by the backup fence device
because the IPMI would have been lost.
I've got ten or so customers around north america and I've only been
doing this for four years or so. That I have not *yet* been saved by
backup fencing in no way means it is not needed. :)
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list