[Pacemaker] Not unmoving colocated resources can provoke DRBD split-brain

Wed Jun 11 15:14:59 CEST 2014

Hi Andrew,

On 02.06.2014 02:57, Andrew Beekhof wrote:

>> This seems to be some kind of a race condition: I added
>> 	sleep 3
>> to a central point in /usr/lib/ocf/resource.d/linbit/drbd.
> 
> Define central?

=======================================================================
$ diff -u drbd.orig drbd

--- drbd.orig    2014-06-11 14:02:57.000000000 +0200
+++ drbd 2014-06-10 16:37:59.000000000 +0200
@@ -1047,6 +1047,11 @@
 # Everything except usage and meta-data must pass the validate test
 drbd_validate_all || exit

+if $USE_DEBUG_LOG ; then
+       echo OCF_ACTION=$__OCF_ACTION `date` >&9
+       sleep 3
+fi
+
 case $__OCF_ACTION in
 start)
        drbd_start
=======================================================================

>> 1.) Note the parallel "start" at 15:46:53. This could very well end up
>> in a race condition without "sleep 3".
>>
>> 2.) Why is pacemaker doing "stop/start" at all on korfwf02?
> 
> This seems to correspond to 
> 
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Move    stonith-korfwf02	(Started korfwm01 -> korfwf01)
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Move    ALL-ffm	(Started korfwf02 -> korfwf01)
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Demote  DRBD-ffm:0	(Master -> Slave korfwf02)
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Restart DRBD-ffm:0	(Slave korfwf02)
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Start   DRBD-ffm:1	(korfwf01)
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Promote DRBD-ffm:1	(Stopped -> Master korfwf01)
> May 23 13:29:31 korfwm01 pengine[5140]:   notice: process_pe_message: Calculated Transition 843: /var/lib/pacemaker/pengine/pe-input-728.bz2
> 
> from your original tarball.
> 
> In that case, the cause is:
> 
>       <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm"/>
> 
> Which requires that ms-DRBD-ffm be completely stopped if ALL-ffm is stopped (which it is because its being moved to 01).
> Perhaps you meant this?
> 
>       <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm" then-action="promote"/>

I tried that. It triggered another race condition.

=======================================================================
primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
 op start interval=0 timeout=240 \
 op promote interval=0 timeout=90 \
 op demote interval=0 timeout=90 \
 op notify interval=0 timeout=90 \
 op stop interval=0 timeout=100 \
 op monitor role=Slave timeout=20 interval=20 \
 op monitor role=Master timeout=20 interval=10
ms ms-DRBD-ffm DRBD-ffm meta master-max=1 master-node-max=1 \
 clone-max=2 clone-node-max=1 notify=true
colocation coloc-ms-DRBD-ffm-follows-ALL-ffm inf: \
 ms-DRBD-ffm:Master ALL-ffm
order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm:promote
location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02
=======================================================================

# crm node standby korfwf01 ; sleep 10
# crm node online korfwf01 ; sleep 10
# crm resource move ALL-ffm korfwf01 ; sleep 10
# crm node standby korfwf01 ; sleep 10
# crm node online korfwf01 ; sleep 10
*bang* split-brain.

This is because with the last command "online korfwf01" pacemaker starts
and the immediately promotes ms-DRBD-ffm without giving any time for
drbd to sync with the peer. Look at this log excerpt:

14:16:16 korfwf01 drbd ffm: Starting worker thread (from drbdsetup [30742])
14:16:16 korfwf01 block drbd7: disk( Diskless -> Attaching )
14:16:16 korfwf01 block drbd7: disk( Attaching -> UpToDate )
14:16:16 korfwf01 drbd ffm: conn( StandAlone -> Unconnected )
14:16:16 korfwf01 drbd ffm: conn( Unconnected -> WFConnection )
14:16:16 korfwf01 block drbd7: role( Secondary -> Primary )
14:16:16 korfwf01 drbd ffm: conn( WFConnection -> WFReportParams )
14:16:17 korfwf01 notify-split-brain.sh[30933]: invoked for ffm/0 (drbd7)

After "start" korfwf01 progresses until WFConnection, it does not know
anything about the state of korfwf02 yet. Then comes "promote", korfwf01
changes to Primary. Only after that both nodes connect and korfwf01
learns that korfwf02 has been Primary in the meantime -> split brain.

This does not happen in the first "standby/online/move" cycle because of
"sleep 10" between "online" and "move", thus allowing for some time
between "start" and "promote" and for re-connection between both nodes.

If have attached the crm_report to
	http://bugs.clusterlabs.org/show_bug.cgi?id=5217

Kind regards,
Robert