[Pacemaker] DRBD and Pacemaker/Corosync

Francois Gaudreault fgaudreault at inverse.ca
Tue May 1 16:38:58 UTC 2012


Hi,

I am not sure if this is the right mailing list to post for our 
problems, let me know if this is more a Corosync issue.

We are facing a weird behavior with Corosync/Pacemaker/DRBD, and I 
expect to have some inputs and help to troubleshoot what is going on. 
Let me try to explain the best I can what we do.

We have a cluster with two node using DRBD running SQL stuff.  We do 
have a location constraint that tells Pacemaker that DRBD should be 
primary on NODE A when able.  When we migrate the services to NODE B, 
and we reboot NODE A, we are facing split brains every single time.

 From what we can see in the logs, it appears that the DRBD resource, 
for some reason, is not waiting for getting an established connection 
(to get initial sync) before changing its role to Primary.  (I apologize 
for the length of the log):

May  1 10:30:34 npf1 kernel: d-con mysql: Starting worker thread (from 
drbdsetup [1960])
May  1 10:30:34 npf1 kernel: block drbd0: disk( Diskless -> Attaching )
May  1 10:30:34 npf1 kernel: block drbd0: Method to ensure write 
ordering: barrier
May  1 10:30:34 npf1 kernel: block drbd0: max BIO size = 1048576
May  1 10:30:34 npf1 kernel: block drbd0: drbd_bm_resize called with 
capacity == 143355552
May  1 10:30:34 npf1 kernel: block drbd0: resync bitmap: bits=17919444 
words=279992 pages=547
May  1 10:30:34 npf1 kernel: block drbd0: size = 68 GB (71677776 KB)
May  1 10:30:34 npf1 kernel: block drbd0: bitmap READ of 547 pages took 
26 jiffies
May  1 10:30:34 npf1 kernel: block drbd0: recounting of set bits took 
additional 1 jiffies
May  1 10:30:34 npf1 kernel: block drbd0: 0 KB (0 bits) marked 
out-of-sync by on disk bit-map.
May  1 10:30:34 npf1 kernel: block drbd0: disk( Attaching -> UpToDate )
May  1 10:30:34 npf1 kernel: block drbd0: attached to UUIDs 
343EF8A0E434C9D8:0000000000000000:282208A916EB5687:282108A916EB5687
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
May  1 10:30:34 npf1 kernel: d-con mysql: conn( StandAlone -> Unconnected )
May  1 10:30:34 npf1 kernel: d-con mysql: Starting receiver thread (from 
drbd_w_mysql [1961])
May  1 10:30:34 npf1 kernel: d-con mysql: receiver (re)started
May  1 10:30:34 npf1 kernel: d-con mysql: conn( Unconnected -> 
WFConnection )
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD:0_start_0 (call=7, rc=0, cib-update=12, confirmed=true) ok
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=53:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_notify_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:8: notify
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)
May  1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing 
resource op DRBD:0_notify_0 from 
53:84:0:5bc3f587-ac97-491b-b102-2325c4352589: lrm_invoke-lrmd-1335882634-4
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD:0_notify_0 (call=8, rc=0, cib-update=0, confirmed=true) ok
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=58:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_notify_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:9: notify
May  1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing 
resource op DRBD:0_notify_0 from 
58:84:0:5bc3f587-ac97-491b-b102-2325c4352589: lrm_invoke-lrmd-1335882634-5
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD:0_notify_0 (call=9, rc=0, cib-update=0, confirmed=true) ok
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=20:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_promote_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:10: promote
May  1 10:30:34 npf1 kernel: block drbd0: role( Secondary -> Primary )
May  1 10:30:34 npf1 kernel: block drbd0: new current UUID 
66767846186A82BB:343EF8A0E434C9D8:282208A916EB5687:282108A916EB5687
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:promote:stdout)
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD:0_promote_0 (call=10, rc=0, cib-update=13, confirmed=true) ok
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=59:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_notify_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:11: notify
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)
May  1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing 
resource op DRBD:0_notify_0 from 
59:84:0:5bc3f587-ac97-491b-b102-2325c4352589: lrm_invoke-lrmd-1335882634-6
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD:0_notify_0 (call=11, rc=0, cib-update=0, confirmed=true) ok
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=7:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD_fs_start_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:12: start
May  1 10:30:34 npf1 Filesystem[2089]: INFO: Running start for 
/dev/drbd0 on /var/lib/mysql
May  1 10:30:34 npf1 kernel: EXT4-fs (drbd0): mounted filesystem with 
ordered data mode. Opts:
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD_fs_start_0 (call=12, rc=0, cib-update=14, confirmed=true) ok
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=8:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD_fs_monitor_120000 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:13: monitor
May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing 
key=9:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=MySQL_start_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:MySQL:14: start
May  1 10:30:34 npf1 lrmd: [2176]: WARN: For LSB init script, no 
additional parameters are needed.
May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM 
operation DRBD_fs_monitor_120000 (call=13, rc=0, cib-update=15, 
confirmed=false) o
May  1 10:30:35 npf1 kernel: d-con mysql: Handshake successful: Agreed 
network protocol version 100
May  1 10:30:35 npf1 kernel: d-con mysql: conn( WFConnection -> 
WFReportParams )
May  1 10:30:35 npf1 kernel: d-con mysql: Starting asender thread (from 
drbd_r_mysql [1973])
May  1 10:30:35 npf1 kernel: block drbd0: drbd_sync_handshake:
May  1 10:30:35 npf1 kernel: block drbd0: self 
66767846186A82BB:343EF8A0E434C9D8:282208A916EB5687:282108A916EB5687 
bits:1 flags:0
May  1 10:30:35 npf1 kernel: block drbd0: peer 
D06952D9712CA916:343EF8A0E434C9D8:282208A916EB5686:282108A916EB5687 
bits:41 flags:0
May  1 10:30:35 npf1 kernel: block drbd0: uuid_compare()=100 by rule 90
May  1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm 
initial-split-brain minor-0
May  1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm 
initial-split-brain minor-0 exit code 0 (0x0)
May  1 10:30:35 npf1 kernel: block drbd0: Split-Brain detected, 1 
primaries, automatically solved. Sync from this node
May  1 10:30:35 npf1 kernel: block drbd0: peer( Unknown -> Secondary ) 
conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )

The only solutions we found is to put automatic SB resolving and 
resource-stickiness, but we do NOT want auto split brain recovery and we 
want the services to go back on NODE A when the cluster is back to a 
sane state.

I can provide the configs if needed.

Thanks for your help!

-- 
Francois Gaudreault, ing. jr
fgaudreault at inverse.ca  ::  +1.514.447.4918 (x130) ::  www.inverse.ca
Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence 
(www.packetfence.org)




More information about the Pacemaker mailing list