[Pacemaker] Not unmoving colocated resources can provoke DRBD split-brain

Thu Jun 12 02:10:55 CEST 2014

Referring to the king of drbd... 
Lars, question for you inline.

On 11 Jun 2014, at 11:14 pm, Robert Dahlem <Robert.Dahlem at gmx.net> wrote:

> Hi Andrew,
> 
> On 02.06.2014 02:57, Andrew Beekhof wrote:
> 
>>> This seems to be some kind of a race condition: I added
>>> 	sleep 3
>>> to a central point in /usr/lib/ocf/resource.d/linbit/drbd.
>> 
>> Define central?
> 
> =======================================================================
> $ diff -u drbd.orig drbd
> --- drbd.orig    2014-06-11 14:02:57.000000000 +0200
> +++ drbd 2014-06-10 16:37:59.000000000 +0200
> @@ -1047,6 +1047,11 @@
> # Everything except usage and meta-data must pass the validate test
> drbd_validate_all || exit
> 
> +if $USE_DEBUG_LOG ; then
> +       echo OCF_ACTION=$__OCF_ACTION `date` >&9
> +       sleep 3
> +fi
> +
> case $__OCF_ACTION in
> start)
>        drbd_start
> =======================================================================
> 
>>> 1.) Note the parallel "start" at 15:46:53. This could very well end up
>>> in a race condition without "sleep 3".
>>> 
>>> 2.) Why is pacemaker doing "stop/start" at all on korfwf02?
>> 
>> This seems to correspond to 
>> 
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Move    stonith-korfwf02	(Started korfwm01 -> korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Move    ALL-ffm	(Started korfwf02 -> korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Demote  DRBD-ffm:0	(Master -> Slave korfwf02)
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Restart DRBD-ffm:0	(Slave korfwf02)
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Start   DRBD-ffm:1	(korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: LogActions: Promote DRBD-ffm:1	(Stopped -> Master korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]:   notice: process_pe_message: Calculated Transition 843: /var/lib/pacemaker/pengine/pe-input-728.bz2
>> 
>> from your original tarball.
>> 
>> In that case, the cause is:
>> 
>>      <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm"/>
>> 
>> Which requires that ms-DRBD-ffm be completely stopped if ALL-ffm is stopped (which it is because its being moved to 01).
>> Perhaps you meant this?
>> 
>>      <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm" then-action="promote"/>
> 
> I tried that. It triggered another race condition.
> 
> =======================================================================
> primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
> op start interval=0 timeout=240 \
> op promote interval=0 timeout=90 \
> op demote interval=0 timeout=90 \
> op notify interval=0 timeout=90 \
> op stop interval=0 timeout=100 \
> op monitor role=Slave timeout=20 interval=20 \
> op monitor role=Master timeout=20 interval=10
> ms ms-DRBD-ffm DRBD-ffm meta master-max=1 master-node-max=1 \
> clone-max=2 clone-node-max=1 notify=true
> colocation coloc-ms-DRBD-ffm-follows-ALL-ffm inf: \
> ms-DRBD-ffm:Master ALL-ffm
> order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm:promote
> location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
> location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02
> =======================================================================
> 
> # crm node standby korfwf01 ; sleep 10
> # crm node online korfwf01 ; sleep 10
> # crm resource move ALL-ffm korfwf01 ; sleep 10
> # crm node standby korfwf01 ; sleep 10
> # crm node online korfwf01 ; sleep 10
> *bang* split-brain.
> 
> This is because with the last command "online korfwf01" pacemaker starts
> and the immediately promotes ms-DRBD-ffm without giving any time for
> drbd to sync with the peer.

Have you seen anything like this before?
I don't know we have any capacity to delay the promotion in the PE... 
perhaps the agent needs to delay setting a master score if its out of date?
or maybe loop in the promote action and set a really long timeout

> Look at this log excerpt:
> 
> 14:16:16 korfwf01 drbd ffm: Starting worker thread (from drbdsetup [30742])
> 14:16:16 korfwf01 block drbd7: disk( Diskless -> Attaching )
> 14:16:16 korfwf01 block drbd7: disk( Attaching -> UpToDate )
> 14:16:16 korfwf01 drbd ffm: conn( StandAlone -> Unconnected )
> 14:16:16 korfwf01 drbd ffm: conn( Unconnected -> WFConnection )
> 14:16:16 korfwf01 block drbd7: role( Secondary -> Primary )
> 14:16:16 korfwf01 drbd ffm: conn( WFConnection -> WFReportParams )
> 14:16:17 korfwf01 notify-split-brain.sh[30933]: invoked for ffm/0 (drbd7)
> 
> After "start" korfwf01 progresses until WFConnection, it does not know
> anything about the state of korfwf02 yet. Then comes "promote", korfwf01
> changes to Primary. Only after that both nodes connect and korfwf01
> learns that korfwf02 has been Primary in the meantime -> split brain.
> 
> This does not happen in the first "standby/online/move" cycle because of
> "sleep 10" between "online" and "move", thus allowing for some time
> between "start" and "promote" and for re-connection between both nodes.
> 
> If have attached the crm_report to
> 	http://bugs.clusterlabs.org/show_bug.cgi?id=5217
> 
> Kind regards,
> Robert
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140612/21b73c01/attachment.sig>