[Pacemaker] Not unmoving colocated resources can provoke DRBD split-brain
Robert Dahlem
Robert.Dahlem at gmx.net
Thu May 22 14:47:56 UTC 2014
Hi,
I have a 4-Node-Cluster (korfwf01, korfwf02, korfwm01, korfwm02).
There is a DRBD resource which should only run on korfwf01 korfwf02:
primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
op start interval=0 timeout=240 \
op promote interval=0 timeout=90 \
op demote interval=0 timeout=90 \
op notify interval=0 timeout=90 \
op stop interval=0 timeout=100 \
op monitor role=Slave timeout=20 interval=20 \
op monitor role=Master timeout=20 interval=10
ms ms-DRBD-ffm DRBD-ffm \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02
I would like to have a Dummy resource "All-ffm" working much like a
group, but not that strict. If I move that Dummy resource from node to
node, other resources depending on it should follow.
primitive ALL-ffm ocf:heartbeat:Dummy
location loc-ALL-ffm-korfwf01 ALL-ffm 2: korfwf01
location loc-ALL-ffm-korfwf02 ALL-ffm 1: korfwf02
location loc-ALL-ffm-korfwm01 ALL-ffm -inf: korfwm01
location loc-ALL-ffm-korfwm02 ALL-ffm -inf: korfwm02
colocation coloc-ms-DRBD-ffm-with-ALL-ffm inf: ms-DRBD-ffm:Master ALL-ffm
order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm
In the beginning everything is ok:
# crm status
ALL-ffm (ocf::heartbeat:Dummy): Started korfwf01
Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
Masters: [ korfwf01 ]
Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
7:ffm/0 Connected Primary/Secondary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
7:ffm/0 Connected Secondary/Primary UpToDate/UpToDate
Standby korfwf01, resources are expected to move to korfwf02:
# crm status
ALL-ffm (ocf::heartbeat:Dummy): Started korfwf02
Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
Masters: [ korfwf02 ]
Stopped: [ korfwf01 korfwm01 korfwm02 ]
# ssh korfwf01 drbd-overview
7:ffm/0 Unconfigured . . . .
# ssh korfwf02 drbd-overview
7:ffm/0 WFConnection Primary/Unknown UpToDate/DUnknown
Standby korfwf02, resources are expected to stop
# crm node standby korfwf02
# crm status
./.
# ssh korfwf01 drbd-overview
7:ffm/0 Unconfigured . . . .
# ssh korfwf02 drbd-overview
7:ffm/0 Unconfigured . . . .
Online korfwf02, resources are expected to start on korfwf02
# crm node online korfwf02
# crm status
ALL-ffm (ocf::heartbeat:Dummy): Started korfwf02
Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
Masters: [ korfwf02 ]
Stopped: [ korfwf01 korfwm01 korfwm02 ]
# ssh korfwf01 drbd-overview
7:ffm/0 Unconfigured . . . .
# ssh korfwf02 drbd-overview
7:ffm/0 WFConnection Primary/Unknown UpToDate/DUnknown
Online korfwf01, resources are expected to STAY on korfwf02
# crm node online korfwf02
# crm status
ALL-ffm (ocf::heartbeat:Dummy): Started korfwf02
Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
Masters: [ korfwf02 ]
Slaves: [ korfwf01 ]
# ssh korfwf01 drbd-overview
7:ffm/0 Connected Secondary/Primary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
7:ffm/0 Connected Primary/Secondary UpToDate/UpToDate
Move ALL-ffm to korfwf01, resources are expected to move to korfwf01
# crm resource move ALL-ffm korfwf01
# crm status
ALL-ffm (ocf::heartbeat:Dummy): Started korfwf01
Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
Masters: [ korfwf01 ]
Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
7:ffm/0 Connected Primary/Secondary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
7:ffm/0 Connected Secondary/Primary UpToDate/UpToDate
Now I "forget" to unmove ALL-ffm and repeat the sequence
# crm node standby korfwf01 ; sleep 10
# crm node standby korfwf02 ; sleep 10
# crm node online korfwf02 ; sleep 10
# crm node online korfwf01 ; sleep 10
# crm status
ALL-ffm (ocf::heartbeat:Dummy): Started korfwf01
Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
Masters: [ korfwf01 ]
Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
7:ffm/0 StandAlone Primary/Unknown UpToDate/DUnknown
# ssh korfwf02 drbd-overview
7:ffm/0 WFConnection Secondary/Unknown UpToDate/DUnknown
*BANG* reproducible DRBD split-brain after the last step.
This does NOT happen without the dependencies on the Dummy resource. I
think there might be some unfortunate timing of drbd start and stop
commands.
SLES 11 SP3
drbd-8.4.4-0.22.9
drbd-pacemaker-8.4.4-0.22.9
pacemaker-1.1.10-0.15.25
What can I provide to help analyze this?
Kind regards,
Robert
More information about the Pacemaker
mailing list