[Pacemaker] Not unmoving colocated resources can provoke DRBD split-brain
Andrew Beekhof
andrew at beekhof.net
Thu Jun 12 02:10:55 CEST 2014
Referring to the king of drbd...
Lars, question for you inline.
On 11 Jun 2014, at 11:14 pm, Robert Dahlem <Robert.Dahlem at gmx.net> wrote:
> Hi Andrew,
>
> On 02.06.2014 02:57, Andrew Beekhof wrote:
>
>>> This seems to be some kind of a race condition: I added
>>> sleep 3
>>> to a central point in /usr/lib/ocf/resource.d/linbit/drbd.
>>
>> Define central?
>
> =======================================================================
> $ diff -u drbd.orig drbd
> --- drbd.orig 2014-06-11 14:02:57.000000000 +0200
> +++ drbd 2014-06-10 16:37:59.000000000 +0200
> @@ -1047,6 +1047,11 @@
> # Everything except usage and meta-data must pass the validate test
> drbd_validate_all || exit
>
> +if $USE_DEBUG_LOG ; then
> + echo OCF_ACTION=$__OCF_ACTION `date` >&9
> + sleep 3
> +fi
> +
> case $__OCF_ACTION in
> start)
> drbd_start
> =======================================================================
>
>>> 1.) Note the parallel "start" at 15:46:53. This could very well end up
>>> in a race condition without "sleep 3".
>>>
>>> 2.) Why is pacemaker doing "stop/start" at all on korfwf02?
>>
>> This seems to correspond to
>>
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Move stonith-korfwf02 (Started korfwm01 -> korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Move ALL-ffm (Started korfwf02 -> korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Demote DRBD-ffm:0 (Master -> Slave korfwf02)
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Restart DRBD-ffm:0 (Slave korfwf02)
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Start DRBD-ffm:1 (korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Promote DRBD-ffm:1 (Stopped -> Master korfwf01)
>> May 23 13:29:31 korfwm01 pengine[5140]: notice: process_pe_message: Calculated Transition 843: /var/lib/pacemaker/pengine/pe-input-728.bz2
>>
>> from your original tarball.
>>
>> In that case, the cause is:
>>
>> <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm"/>
>>
>> Which requires that ms-DRBD-ffm be completely stopped if ALL-ffm is stopped (which it is because its being moved to 01).
>> Perhaps you meant this?
>>
>> <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm" then-action="promote"/>
>
> I tried that. It triggered another race condition.
>
> =======================================================================
> primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
> op start interval=0 timeout=240 \
> op promote interval=0 timeout=90 \
> op demote interval=0 timeout=90 \
> op notify interval=0 timeout=90 \
> op stop interval=0 timeout=100 \
> op monitor role=Slave timeout=20 interval=20 \
> op monitor role=Master timeout=20 interval=10
> ms ms-DRBD-ffm DRBD-ffm meta master-max=1 master-node-max=1 \
> clone-max=2 clone-node-max=1 notify=true
> colocation coloc-ms-DRBD-ffm-follows-ALL-ffm inf: \
> ms-DRBD-ffm:Master ALL-ffm
> order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm:promote
> location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
> location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02
> =======================================================================
>
> # crm node standby korfwf01 ; sleep 10
> # crm node online korfwf01 ; sleep 10
> # crm resource move ALL-ffm korfwf01 ; sleep 10
> # crm node standby korfwf01 ; sleep 10
> # crm node online korfwf01 ; sleep 10
> *bang* split-brain.
>
> This is because with the last command "online korfwf01" pacemaker starts
> and the immediately promotes ms-DRBD-ffm without giving any time for
> drbd to sync with the peer.
Have you seen anything like this before?
I don't know we have any capacity to delay the promotion in the PE...
perhaps the agent needs to delay setting a master score if its out of date?
or maybe loop in the promote action and set a really long timeout
> Look at this log excerpt:
>
> 14:16:16 korfwf01 drbd ffm: Starting worker thread (from drbdsetup [30742])
> 14:16:16 korfwf01 block drbd7: disk( Diskless -> Attaching )
> 14:16:16 korfwf01 block drbd7: disk( Attaching -> UpToDate )
> 14:16:16 korfwf01 drbd ffm: conn( StandAlone -> Unconnected )
> 14:16:16 korfwf01 drbd ffm: conn( Unconnected -> WFConnection )
> 14:16:16 korfwf01 block drbd7: role( Secondary -> Primary )
> 14:16:16 korfwf01 drbd ffm: conn( WFConnection -> WFReportParams )
> 14:16:17 korfwf01 notify-split-brain.sh[30933]: invoked for ffm/0 (drbd7)
>
> After "start" korfwf01 progresses until WFConnection, it does not know
> anything about the state of korfwf02 yet. Then comes "promote", korfwf01
> changes to Primary. Only after that both nodes connect and korfwf01
> learns that korfwf02 has been Primary in the meantime -> split brain.
>
> This does not happen in the first "standby/online/move" cycle because of
> "sleep 10" between "online" and "move", thus allowing for some time
> between "start" and "promote" and for re-connection between both nodes.
>
> If have attached the crm_report to
> http://bugs.clusterlabs.org/show_bug.cgi?id=5217
>
> Kind regards,
> Robert
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140612/21b73c01/attachment.sig>
More information about the Pacemaker
mailing list