[Pacemaker] DRBD and fencing

Fri Mar 12 09:55:02 UTC 2010

On Thu, Mar 11, 2010 at 05:26:19PM +0800, Martin Aspeli wrote:
> Matthew Palmer wrote:
>> On Thu, Mar 11, 2010 at 03:34:50PM +0800, Martin Aspeli wrote:
>>> I was wondering, though, if fencing at the DRBD level would get around
>>> the possible problem with a full power outage taking the fencing device
>>> down.
>>>
>>> In my poor understanding of things, it'd work like this:
>>>
>>>   - Pacemaker runs on master and slave
>>>   - Master loses all power
>>>   - Pacemaker on slave notices something is wrong, and prepares to start
>>> up postgres on slave, which will now also be the one writing to the DRBD
>>> disk
>>>   - Before it can do that, it wants to fence off DRBD
>>>   - It does that by saying to the local DRBD, "even if the other node
>>> tries to send you stuff, ignore it". This would avoid the risk of data
>>> corruption on slave. Before master could came back up, it'd need to wipe
>>> its local partition and re-sync from slave (which is now the new
>>> primary).
>>
>> The old master shouldn't need to "wipe" anything, as it should have no data
>> that the new master didn't have at the time of the power failure.
>
> I was just thinking that if the failure was, e.g., the connection  
> between master and the rest of the cluster, postgres on the old master  
> could stay up and merrily keep writing to the filesystem on the DRBD.

That can't happen, because the cluster manager should fence the "failed"
node before it starts mounting on the other node.

> In the case of power failure, that wouldn't happen, of course. But in  
> case of total power failure, the fencing device (an IPMI device, Dell  
> DRAC) would be inaccessible too, so the cluster would not fail postgres  
> over.

Hence why you want a real STONITH device if you want true reliability.

>> In the case you suggest, where the whole of node "A" disappears, you may
>> well have a fencing problem: because node "B" can't positively confirm that
>> "A" is, in fact, dead (because the DRAC went away too), it may refuse to
>> confirm the fencing operation (this is why using DRAC/IPMI as a STONITH
>> device isn't such a win).
>
> From what I'm reading, the only fencing device that's truly good is a  
> UPS that can cut power to an individual device. Unfortunately, we don't  
> have such a device and can't get one. We do have a UPS with a backup  
> generator, and dual PSUs, so total power outage is unlikely. But someone  
> could also just pull the (two) cables out of the UPS and pacemaker would  
> be none the wiser.

Managed power rails are also pretty good STONITH devices.

> What I don't get is, if this happens, why can't slave just say, "I'm  
> going to assume master is gone and take over postgres, and I'm not going  
> to let anyone else write anything to my disk". In my mind, this is  
> similar to having a shared SAN and having the fencing operation be "node  
> master is no longer allowed to mount or write to the SAN disk, even if  
> it tries".

You can't do that because it is the very definition of "split brain" --
without positive confirmation that the other node is, actually, dead, both
nodes can think the other one is dead and that it is the only living node. 
The shared SAN is a completely different situation, because you have a
*single* device that is capable of deciding who can use it, whereas there is
no single device with DRBD (which has the benefit of having no single point
of failure).

- Matt