[ClusterLabs] Restart of 2-node cluster causes split brain ?
Stefan Schloesser
sschloesser at enomic.com
Mon May 11 07:17:17 UTC 2015
Hi Ken,
thanks for the pointer, that solved the problem. This section seems new in Ubuntu 14.04.
Stefan
-----Original Message-----
From: Ken Gaillot [mailto:kgaillot at redhat.com]
Sent: Friday,8 May, 2015 18:41
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] Restart of 2-node cluster causes split brain ?
On 05/07/2015 04:25 AM, Stefan Schloesser wrote:
> Hi,
>
> I have a 2 node drbd cluster. If for some reason one node is killed I am unable to restart the cluster without a split brain. I am running Ubuntu 14.04. This is what happens:
> After rebooting the downed node (named sec) I start on it corosync and pacemaker. Sec then immediately kills prim without waiting for drbd to be synced and starts all services on itself causing the split brain.
>
> In the log I see:
> pengine: info: determine_online_status_fencing: Node sec is active
> May 07 09:25:08 [7061] sec pengine: info: determine_online_status: Node sec is online
> May 07 09:25:08 [7061] sec pengine: info: native_print: stonith_sec (stonith:external/hetzner): Stopped
> May 07 09:25:08 [7061] sec pengine: info: native_print: stonith_prim (stonith:external/hetzner): Stopped
> May 07 09:25:08 [7061] sec pengine: info: native_print: ip (ocf::kumina:hetzner-failover-ip): Stopped
> May 07 09:25:08 [7061] sec pengine: info: clone_print: Master/Slave Set: ms_drbd [drbd]
> May 07 09:25:08 [7061] sec pengine: info: short_print: Stopped: [ prim sec ]
> May 07 09:25:08 [7061] sec pengine: info: native_print: fs (ocf::heartbeat:Filesystem): Stopped
> May 07 09:25:08 [7061] sec pengine: info: native_print: mysql (ocf::heartbeat:mysql): Stopped
> May 07 09:25:08 [7061] sec pengine: info: native_print: apache (ocf::heartbeat:apache): Stopped
> May 07 09:25:08 [7061] sec pengine: info: native_color: Resource stonith_sec cannot run anywhere
> May 07 09:25:08 [7061] sec pengine: info: native_color: Resource drbd:1 cannot run anywhere
> May 07 09:25:08 [7061] sec pengine: info: master_color: ms_drbd: Promoted 0 instances of a possible 1 to master
> May 07 09:25:08 [7061] sec pengine: info: rsc_merge_weights: fs: Rolling back scores from mysql
> May 07 09:25:08 [7061] sec pengine: info: native_color: Resource fs cannot run anywhere
> May 07 09:25:08 [7061] sec pengine: info: rsc_merge_weights: mysql: Rolling back scores from apache
> May 07 09:25:08 [7061] sec pengine: info: native_color: Resource mysql cannot run anywhere
> May 07 09:25:08 [7061] sec pengine: info: rsc_merge_weights: apache: Rolling back scores from ip
> May 07 09:25:08 [7061] sec pengine: info: native_color: Resource apache cannot run anywhere
> May 07 09:25:08 [7061] sec pengine: info: native_color: Resource ip cannot run anywhere
> May 07 09:25:08 [7061] sec pengine: info: RecurringOp: Start recurring monitor (3600s) for stonith_prim on sec
> May 07 09:25:08 [7061] sec pengine: info: RecurringOp: Start recurring monitor (31s) for drbd:0 on sec
> May 07 09:25:08 [7061] sec pengine: info: RecurringOp: Start recurring monitor (31s) for drbd:0 on sec
> May 07 09:25:08 [7061] sec pengine: warning: stage6: Scheduling Node prim for STONITH
>
> Version Info:
> 14.04: corosync Version: 2.3.3-1ubuntu1
> Pacemaker Version: 1.1.10+git20130802-1ubuntu2.3
> 12.04: corosync Version: 1.4.2-2ubuntu0.2
> Pacemaker Version: 1.1.6-2ubuntu3.3
>
> I run other clusters with identical setup on Ubuntu 12.04 without such problems. So I believe something major has changed with the versions what I missed.
> Maybe in the original reboot both nodes wanted to kill each other, prim won the race but sec remembered it wanted to kill prim and does so at the first possible opportunity i.e. on restart. Would that be possible? If so how can I stop this behavior?
I do not know whether this is your issue, but for two-node clusters using corosync 2, corosync.conf should have "two_node: 1" in the quorum{} section. That implies "wait_for_all" which prevents some fencing loops. See the votequorum(5) man page for details.
_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list