[Pacemaker] Getting split brain after all reboot of a cluster node

Digimer lists at alteeve.ca
Thu Mar 6 14:44:32 EST 2014


On 06/03/14 12:11 PM, Anne Nicolas wrote:
> Le 06/03/2014 10:12, Gianluca Cecchi a écrit :
>> On Wed, Mar 5, 2014 at 9:28 AM, Anne Nicolas  wrote:
>>> Hi
>>>
>>> I'm having trouble setting a very simple cluster with 2 nodes. After all
>>> reboot I'm getting split brain that I have to solve by hand then.
>>> Looking for a solution for that one...
>>>
>>> Both nodes have 4 network interfaces. We use 3 of them: one for an IP
>>> cluster, one for a bridge for a vm and the last one for the private
>>> network of the cluster
>>>
>>> I'm using
>>> drbd : 8.3.9
>>> drbd-utils: 8.3.9
>>>
>>> DRBD configuration:
>>> ============
>>> $ cat global_common.conf
>>> global {
>>>          usage-count no;
>>>          disable-ip-verification;
>>>   }
>>> common { syncer { rate 500M; } }
>>>
>>> cat server.res
>>> resource server {
>>>          protocol C;
>>>          net {
>>>                   cram-hmac-alg sha1;
>>>                   shared-secret "eafcupps";
>>>              }
>>>   on dzacupsvr {
>>>      device     /dev/drbd0;
>>>      disk       /dev/vg0/server;
>>>      address    172.16.1.1:7788;
>>>      flexible-meta-disk  internal;
>>>    }
>>>    on dzacupsvr2 {
>>>      device     /dev/drbd0;
>>>      disk       /dev/vg0/server;
>>>      address    172.16.1.2:7788;
>>>      flexible-meta-disk  internal;
>>>    }
>>> }
>>>
>>
>> [snip]
>>
>>>
>>> After looking for more information, I've added fences in drbd configuration
>>>
>>> handlers {
>>>      fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>>      after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>>>    }
>>> but still without any success...
>>>
>>> Any help appreciated
>>>
>>> Cheers
>>>
>>> --
>>> Anne
>>
>> Hello Anne,
>> for sure follow the stonith advises from digimer and emmanuel.
>> As a starting point I think you can add this part in your resource
>> definition that seems missing at the moment:
>>
>> resource <resource> {
>>    disk {
>>      fencing resource-only;
>>      ...
>>    }
>> }
>>
>> This should manage without problem a clean shutdown of cluster's nodes
>> and some failure scenarios.
>> But it doesn't completely protect you from data corruption in some
>> cases (such as intercommunication network that suddenly goes down and
>> up with both nodes active where both could become primary in some
>> moments).
>>
>> At least this worked for me during initial tests before stonith
>> configuration with
>> SLES 11 sp2 (corosync/pacemaker)
>> CentOS 6.5 (cman/pacemaker)
>>
>
> Thanks a lot for all these information. I'm setting up all this to make
> it work properly. Some parts were indeed missing and not obvious for me.
>
> One more question about something I can hardly understand. What should
> happen and how to manage when the private link between nodes gets down
> for one reason. Doing that, crm status gives both nodes masters. Is
> there a way to manage this kind of failure ?
>
> Thanks in advance

When stonith is configured properly, both nodes will block and a 
stonith/fence action will be called. The faster node will survive and 
the slower node will shut down. Once the survivor confirms the peer's 
death, it will recover and return to normal operation.

Two notes;

1. Please set 'fencing resource-and-stonith;' in DRBD, not just 
'resource-only'. The former tells DRBD to block and call it's fence 
handler, and to stay blocked until the fence handler returns success, 
ensuring no split-brain.

2a. If you have separate fence devices, as you do with IPMI-based 
fencing, it's possible that both nodes can successfully fence the other, 
leaving both nodes down. Add the 'delay="15"' attribute to the node you 
want to *win* a failed link (the delay tells other nodes to wait X 
second before fencing the given node, give it a head-start in it's fence 
call).

2b. Second, if you are using IPMI, the fence action sends a power-off 
ACPI event to the host. If the host has 'acpid' running, this will 
initiate a graceful shut down. The IPMI BMC will hold the power button, 
so after 4 seconds the node will shut down, but that's four seconds 
where the stonithd could get it's fence call out before dying. By 
disabling acpid, most servers will react to a power button event by 
instantly powering off, significantly reducing the chances that the 
slower node will get a fence call out before dying.

2b is particularly important when the link between the nodes and the 
fence devices fail. In this case, the delay="15" will expire and both 
nodes will be sitting there trying to fence the other. So when the 
switch/network finally recovers, both will try to fence right away and 
the delay won't help prevent a dual-fence. So this is why turning off 
the acpid daemon when using IPMI-based fencing is still good, despite 
the delay.

Cheers!

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?




More information about the Pacemaker mailing list