[Pacemaker] Not unmoving colocated resources can provoke DRBD split-brain

Thu May 22 14:59:16 UTC 2014

On 22/05/14 10:47 AM, Robert Dahlem wrote:
> Hi,
>
> I have a 4-Node-Cluster (korfwf01, korfwf02, korfwm01, korfwm02).
>
> There is a DRBD resource which should only run on korfwf01 korfwf02:
>
> primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
>     op start interval=0 timeout=240 \
>     op promote interval=0 timeout=90 \
>     op demote interval=0 timeout=90 \
>     op notify interval=0 timeout=90 \
>     op stop interval=0 timeout=100 \
>     op monitor role=Slave timeout=20 interval=20 \
>     op monitor role=Master timeout=20 interval=10
> ms ms-DRBD-ffm DRBD-ffm \
>     meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
> location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
> location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02
>
> I would like to have a Dummy resource "All-ffm" working much like a
> group, but not that strict. If I move that Dummy resource from node to
> node, other resources depending on it should follow.
>
> primitive ALL-ffm ocf:heartbeat:Dummy
> location loc-ALL-ffm-korfwf01 ALL-ffm 2: korfwf01
> location loc-ALL-ffm-korfwf02 ALL-ffm 1: korfwf02
> location loc-ALL-ffm-korfwm01 ALL-ffm -inf: korfwm01
> location loc-ALL-ffm-korfwm02 ALL-ffm -inf: korfwm02
> colocation coloc-ms-DRBD-ffm-with-ALL-ffm inf: ms-DRBD-ffm:Master ALL-ffm
> order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm
>
> In the beginning everything is ok:
> # crm status
> ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
>   Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
>       Masters: [ korfwf01 ]
>       Slaves: [ korfwf02 ]
> # ssh korfwf01 drbd-overview
>    7:ffm/0      Connected    Primary/Secondary UpToDate/UpToDate
> # ssh korfwf02 drbd-overview
>    7:ffm/0  Connected Secondary/Primary UpToDate/UpToDate
>
> Standby korfwf01, resources are expected to move to korfwf02:
> # crm status
> ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
>   Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
>       Masters: [ korfwf02 ]
>       Stopped: [ korfwf01 korfwm01 korfwm02 ]
> # ssh korfwf01 drbd-overview
>    7:ffm/0      Unconfigured . . . .
> # ssh korfwf02 drbd-overview
>    7:ffm/0  WFConnection Primary/Unknown UpToDate/DUnknown
>
> Standby korfwf02, resources are expected to stop
> # crm node standby korfwf02
> # crm status
> ./.
> # ssh korfwf01 drbd-overview
>    7:ffm/0      Unconfigured . . . .
> # ssh korfwf02 drbd-overview
>    7:ffm/0      Unconfigured . . . .
>
> Online korfwf02, resources are expected to start on korfwf02
> # crm node online korfwf02
> # crm status
> ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
>   Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
>       Masters: [ korfwf02 ]
>       Stopped: [ korfwf01 korfwm01 korfwm02 ]
> # ssh korfwf01 drbd-overview
>    7:ffm/0      Unconfigured . . . .
> # ssh korfwf02 drbd-overview
>    7:ffm/0  WFConnection Primary/Unknown UpToDate/DUnknown
>
> Online korfwf01, resources are expected to STAY on korfwf02
> # crm node online korfwf02
> # crm status
> ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
>   Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
>       Masters: [ korfwf02 ]
>       Slaves: [ korfwf01 ]
> # ssh korfwf01 drbd-overview
>    7:ffm/0      Connected    Secondary/Primary UpToDate/UpToDate
> # ssh korfwf02 drbd-overview
>    7:ffm/0  Connected Primary/Secondary UpToDate/UpToDate
>
> Move ALL-ffm to korfwf01, resources are expected to move to korfwf01
> # crm resource move ALL-ffm korfwf01
> # crm status
> ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
>   Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
>       Masters: [ korfwf01 ]
>       Slaves: [ korfwf02 ]
> # ssh korfwf01 drbd-overview
>    7:ffm/0      Connected    Primary/Secondary UpToDate/UpToDate
> # ssh korfwf02 drbd-overview
>    7:ffm/0  Connected Secondary/Primary UpToDate/UpToDate
>
> Now I "forget" to unmove ALL-ffm and repeat the sequence
> # crm node standby korfwf01 ; sleep 10
> # crm node standby korfwf02 ; sleep 10
> # crm node online korfwf02 ; sleep 10
> # crm node online korfwf01 ; sleep 10
> # crm status
> ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
>   Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
>       Masters: [ korfwf01 ]
>       Slaves: [ korfwf02 ]
> # ssh korfwf01 drbd-overview
>    7:ffm/0      StandAlone   Primary/Unknown UpToDate/DUnknown
> # ssh korfwf02 drbd-overview
>    7:ffm/0  WFConnection Secondary/Unknown UpToDate/DUnknown
>
> *BANG* reproducible DRBD split-brain after the last step.
>
> This does NOT happen without the dependencies on the Dummy resource. I
> think there might be some unfortunate timing of drbd start and stop
> commands.
>
> SLES 11 SP3
> drbd-8.4.4-0.22.9
> drbd-pacemaker-8.4.4-0.22.9
> pacemaker-1.1.10-0.15.25
>
> What can I provide to help analyze this?
>
> Kind regards,
> Robert

I can't speak to the pacemaker issue, but I can say that a proper 
stonith config in pacemaker and fencing config in drbd would prevent a 
split-brain. This would cause a node to reboot in this scenario, so you 
still need to resolve it, but a reboot is a heck of a lot better than a 
split-brain.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?