[Pacemaker] Fixed! - Re: Problem with dual-PDU fencing node with redundant PSUs

Thu Jul 4 15:09:35 UTC 2013

On 04/07/13 10:06, Dejan Muhamedagic wrote:
> On Tue, Jul 02, 2013 at 10:53:50AM -0400, Digimer wrote:
>> On 07/02/2013 04:02 AM, Dejan Muhamedagic wrote:
>>> On Mon, Jul 01, 2013 at 11:53:29AM -0400, Digimer wrote:
>>>> On 07/01/2013 04:52 AM, Dejan Muhamedagic wrote:
>>>>> Right. It is often missed that actually more than one failure is
>>>>> required for that setup to fail. In case of dual PDU/PSU/UPS an
>>>>> IPMI based fencing is sufficient.
>>>>
>>>> You are right, of course. Imagine though that the IPMI BMC's network
>>>> port or cable could have silently failed some time before the node
>>>> failed. Yes, this is two simultaneous failues so not an overall SPoF,
>>>> but likely enough that it should be addressed.
>>>>
>>>> If you've already setup redundant power, then it strikes me as fairly
>>>> easy to use your PDUs as a backup fence method.
>>>>
>>>> Now all this said, you'll note in the mailing lists and IRC that I don't
>>>> tell people they should have two methods. If people setup just IPMI
>>>> fencing, I am happy. It's a question of how careful do you want/need to
>>>> be, after that. For me, one fence method is not enough.
>>>
>>> I suppose that you're supporting a few clusters. How often does
>>> it happen that nodes get fenced? And why? And did you in those
>>> cases needed to use the backup fence device?
>>>
>>> Thanks,
>>>
>>> Dejan
>>
>> They occasionally get fenced, but it's very rare. Most were from an
>> earlier configuration I no longer offer that were based on one switch
>> (with redundant NICs in bond mode=1). The switch would hiccup and that
>> would trigger fencing. Since I switched to dual switches, I've not had a
>> network-triggered failure.
>>
>> The most common problem I see, that my cluster saved people from, is
>> power problems. These have never required fencing, but rather simply
>> having two monitored UPSes has allowed us to detecting pending
>> catastrophic power failures (a transformer blew up three days after we
>> started seeing alerts, a faulty regulator in a customer's neighborhood,
>> etc).
>
> Right, I'd also guess that power failures are the most common in
> the hardware category.
>
>> We've also saved a customer's entire (small) DC when they lost AC and
>> their own alerts failed (we saw a sudden rise in inlet temp and alerted
>> the client.). One node at the top of the rack (out of four dual-node
>> clusters) went into thermal shutdown and got fenced before we could shed
>> enough load. They didn't lose any of their non-clustered servers though.
>>
>> So to your question; have we ever needed the backup fencing in
>> production? Nope, but I see it as just a matter of time. One user error,
>> one bad UPS/battery pack, one tripped breaker and it will save us. When
>> we demo our clusters to perspective customers, the most dramatic test we
>> do is shut down the primary UPS. This takes out one of the switches, one
>> of the dashboard appliances and forces the nodes to run on half their
>> power. If this happened in production, then dual-PDUs would certainly
>> save us.
>>
>> Not my personal experience, but a sysadmin friend of mine had a case
>> where a server's 12vDC wire was rubbing against a sharp piece of the
>> chassis. Eventually it cut through the insulation and shorted out,
>> taking the node's power off despite having redundant PSUs. Had this
>> happened to our cluster, we'd have been saved by the backup fence device
>> because the IPMI would have been lost.
>
> There are also some light-out devices with battery backup
> providing power for enough time for fencing to succeed.
>
>> I've got ten or so customers around north america and I've only been
>> doing this for four years or so. That I have not *yet* been saved by
>> backup fencing in no way means it is not needed. :)
>
> I'd really be interested in numbers which we don't have, that is
> how much extra availability in a fully redundant power supply
> setup a backup fencing device provides.
>
> Of course, taking every possible precaution is commendable, but
> in this case it seems like it introduces a level of complexity
> which is hard to grasp for most of people (even those running
> clusters).
>
> Thanks,
>
> Dejan

Much like security, performance and other concerns; It's up to each user 
to find their balance point. For me and my customers, redundant 
everything is required. For many others, perhaps it isn't.

As for the numbers; I would *love* to have those as well. Shy of some 
self-reporting system where HA admins fill out forms after incidents 
though, I don't see how we could ever gather that data. Even then, it 
will never be mandatory, obviously, so the results would be skewed by 
the personality type of people willing and able to take the time to 
submit those anonymous reports.

cheers!

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?