[Pacemaker] Not unmoving colocated resources can provoke DRBD split-brain

Robert Dahlem Robert.Dahlem at gmx.net
Thu May 22 16:47:56 CEST 2014


Hi,

I have a 4-Node-Cluster (korfwf01, korfwf02, korfwm01, korfwm02).

There is a DRBD resource which should only run on korfwf01 korfwf02:

primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
   op start interval=0 timeout=240 \
   op promote interval=0 timeout=90 \
   op demote interval=0 timeout=90 \
   op notify interval=0 timeout=90 \
   op stop interval=0 timeout=100 \
   op monitor role=Slave timeout=20 interval=20 \
   op monitor role=Master timeout=20 interval=10
ms ms-DRBD-ffm DRBD-ffm \
   meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02

I would like to have a Dummy resource "All-ffm" working much like a
group, but not that strict. If I move that Dummy resource from node to
node, other resources depending on it should follow.

primitive ALL-ffm ocf:heartbeat:Dummy
location loc-ALL-ffm-korfwf01 ALL-ffm 2: korfwf01
location loc-ALL-ffm-korfwf02 ALL-ffm 1: korfwf02
location loc-ALL-ffm-korfwm01 ALL-ffm -inf: korfwm01
location loc-ALL-ffm-korfwm02 ALL-ffm -inf: korfwm02
colocation coloc-ms-DRBD-ffm-with-ALL-ffm inf: ms-DRBD-ffm:Master ALL-ffm
order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm

In the beginning everything is ok:
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
 Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
     Masters: [ korfwf01 ]
     Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
  7:ffm/0      Connected    Primary/Secondary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
  7:ffm/0  Connected Secondary/Primary UpToDate/UpToDate

Standby korfwf01, resources are expected to move to korfwf02:
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
 Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
     Masters: [ korfwf02 ]
     Stopped: [ korfwf01 korfwm01 korfwm02 ]
# ssh korfwf01 drbd-overview
  7:ffm/0      Unconfigured . . . .
# ssh korfwf02 drbd-overview
  7:ffm/0  WFConnection Primary/Unknown UpToDate/DUnknown

Standby korfwf02, resources are expected to stop
# crm node standby korfwf02
# crm status
./.
# ssh korfwf01 drbd-overview
  7:ffm/0      Unconfigured . . . .
# ssh korfwf02 drbd-overview
  7:ffm/0      Unconfigured . . . .

Online korfwf02, resources are expected to start on korfwf02
# crm node online korfwf02
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
 Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
     Masters: [ korfwf02 ]
     Stopped: [ korfwf01 korfwm01 korfwm02 ]
# ssh korfwf01 drbd-overview
  7:ffm/0      Unconfigured . . . .
# ssh korfwf02 drbd-overview
  7:ffm/0  WFConnection Primary/Unknown UpToDate/DUnknown

Online korfwf01, resources are expected to STAY on korfwf02
# crm node online korfwf02
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
 Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
     Masters: [ korfwf02 ]
     Slaves: [ korfwf01 ]
# ssh korfwf01 drbd-overview
  7:ffm/0      Connected    Secondary/Primary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
  7:ffm/0  Connected Primary/Secondary UpToDate/UpToDate

Move ALL-ffm to korfwf01, resources are expected to move to korfwf01
# crm resource move ALL-ffm korfwf01
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
 Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
     Masters: [ korfwf01 ]
     Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
  7:ffm/0      Connected    Primary/Secondary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
  7:ffm/0  Connected Secondary/Primary UpToDate/UpToDate

Now I "forget" to unmove ALL-ffm and repeat the sequence
# crm node standby korfwf01 ; sleep 10
# crm node standby korfwf02 ; sleep 10
# crm node online korfwf02 ; sleep 10
# crm node online korfwf01 ; sleep 10
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
 Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
     Masters: [ korfwf01 ]
     Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
  7:ffm/0      StandAlone   Primary/Unknown UpToDate/DUnknown
# ssh korfwf02 drbd-overview
  7:ffm/0  WFConnection Secondary/Unknown UpToDate/DUnknown

*BANG* reproducible DRBD split-brain after the last step.

This does NOT happen without the dependencies on the Dummy resource. I
think there might be some unfortunate timing of drbd start and stop
commands.

SLES 11 SP3
drbd-8.4.4-0.22.9
drbd-pacemaker-8.4.4-0.22.9
pacemaker-1.1.10-0.15.25

What can I provide to help analyze this?

Kind regards,
Robert



More information about the Pacemaker mailing list