[Pacemaker] Split Brain on DRBD Dual Primary

Wed Nov 12 07:16:24 CET 2014

Hi All,

On October archives, I saw the issue reported by Felix Zachlod on http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022653.html and the same is actually happens to me now on dual primary DRBD node.

My current OS was RHEL 6.6 and software version that I used was
pacemaker-1.1.12-4.el6.x86_64
corosync-1.4.7-1.el6.x86_64
cman-3.0.12.1-68.el6.x86_64
drbd84-utils-8.9.1-1.el6.elrepo.x86_64
kmod-drbd84-8.4.5-2.el6.elrepo.x86_64
gfs2-utils-3.0.12.1-68.el6.x86_64

First, I will explain my existing resource. I have 3 resource which are drbd, dlm for gfs2, and HomeFS.

Master: HomeDataClone
  Meta Attrs: master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interval=0s
  Resource: HomeData (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=homedata
   Operations: start interval=0s timeout=240 (HomeData-start-timeout-240)
               promote interval=0s (HomeData-promote-interval-0s)
               demote interval=0s timeout=90 (HomeData-demote-timeout-90)
               stop interval=0s timeout=100 (HomeData-stop-timeout-100)
               monitor interval=60s (HomeData-monitor-interval-60s)
Clone: HomeFS-clone
  Meta Attrs: start-delay=30s target-role=Stopped
  Resource: HomeFS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/drbd/by-res/homedata directory=/home fstype=gfs2
   Operations: start interval=0s timeout=60 (HomeFS-start-timeout-60)
               stop interval=0s timeout=60 (HomeFS-stop-timeout-60)
               monitor interval=20 timeout=40 (HomeFS-monitor-interval-20)
Clone: dlm-clone
  Meta Attrs: clone-max=2 clone-node-max=1 start-delay=0s
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: start interval=0s timeout=90 (dlm-start-timeout-90)
               stop interval=0s timeout=100 (dlm-stop-timeout-100)
               monitor interval=60s (dlm-monitor-interval-60s)

But when I try to start the cluster on normal condition, It will cause split brain on DRBD on each node. From the log I can see it was the same case with Felix which was caused by pacemaker promoting drbd to primary  while it was still waiting for handshake connection on each node.

Nov 12 11:37:32 node002 kernel: block drbd1: disk( Attaching -> UpToDate )
Nov 12 11:37:32 node002 kernel: block drbd1: attached to UUIDs C9630089EC3B58CC:0000000000000000:B4653C665EBC0DBB:B4643C665EBC0DBA
Nov 12 11:37:32 node002 kernel: drbd homedata: conn( StandAlone -> Unconnected )
Nov 12 11:37:32 node002 kernel: drbd homedata: Starting receiver thread (from drbd_w_homedata [22531])
Nov 12 11:37:32 node002 kernel: drbd homedata: receiver (re)started
Nov 12 11:37:32 node002 kernel: drbd homedata: conn( Unconnected -> WFConnection )
Nov 12 11:37:32 node002 attrd[22340]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-HomeData (1000)
Nov 12 11:37:32 node002 attrd[22340]:   notice: attrd_perform_update: Sent update 17: master-HomeData=1000
Nov 12 11:37:32 node002 crmd[22342]:   notice: process_lrm_event: Operation HomeData_start_0: ok (node=node002, call=18, rc=0, cib-update=13, confirmed=true)
Nov 12 11:37:33 node002 crmd[22342]:   notice: process_lrm_event: Operation HomeData_notify_0: ok (node=node002, call=19, rc=0, cib-update=0, confirmed=true)
Nov 12 11:37:33 node002 crmd[22342]:   notice: process_lrm_event: Operation HomeData_notify_0: ok (node=node002, call=20, rc=0, cib-update=0, confirmed=true)
Nov 12 11:37:33 node002 kernel: block drbd1: role( Secondary -> Primary )
Nov 12 11:37:33 node002 kernel: block drbd1: new current UUID 58F02AE0E03C1C91:C9630089EC3B58CC:B4653C665EBC0DBB:B4643C665EBC0DBA
Nov 12 11:37:33 node002 crmd[22342]:   notice: process_lrm_event: Operation HomeData_promote_0: ok (node=node002, call=21, rc=0, cib-update=14, confirmed=true)
Nov 12 11:37:33 node002 attrd[22340]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-HomeData (10000)
Nov 12 11:37:33 node002 attrd[22340]:   notice: attrd_perform_update: Sent update 23: master-HomeData=10000
Nov 12 11:37:33 node002 crmd[22342]:   notice: process_lrm_event: Operation HomeData_notify_0: ok (node=node002, call=22, rc=0, cib-update=0, confirmed=true)
Nov 12 11:37:33 node002 kernel: drbd homedata: Handshake successful: Agreed network protocol version 101
Nov 12 11:37:33 node002 kernel: drbd homedata: Agreed to support TRIM on protocol level
Nov 12 11:37:33 node002 kernel: drbd homedata: Peer authenticated using 20 bytes HMAC
Nov 12 11:37:33 node002 kernel: drbd homedata: conn( WFConnection -> WFReportParams )
Nov 12 11:37:33 node002 kernel: drbd homedata: Starting asender thread (from drbd_r_homedata [22543])
Nov 12 11:37:33 node002 kernel: block drbd1: drbd_sync_handshake:
Nov 12 11:37:33 node002 kernel: block drbd1: self 58F02AE0E03C1C91:C9630089EC3B58CC:B4653C665EBC0DBB:B4643C665EBC0DBA bits:0 flags:0
Nov 12 11:37:33 node002 kernel: block drbd1: peer 0FAA8E4B66817421:C9630089EC3B58CD:B4653C665EBC0DBA:B4643C665EBC0DBA bits:0 flags:0
Nov 12 11:37:33 node002 kernel: block drbd1: uuid_compare()=100 by rule 90
Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
Nov 12 11:37:33 node002 kernel: block drbd1: Split-Brain detected but unresolved, dropping connection!
Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1
Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
Nov 12 11:37:33 node002 kernel: drbd homedata: conn( WFReportParams -> Disconnecting )
Nov 12 11:37:33 node002 kernel: drbd homedata: error receiving ReportState, e: -5 l: 0!
Nov 12 11:37:33 node002 kernel: drbd homedata: asender terminated
Nov 12 11:37:33 node002 kernel: drbd homedata: Terminating drbd_a_homedata
Nov 12 11:37:33 node002 kernel: drbd homedata: Connection closed
Nov 12 11:37:33 node002 kernel: drbd homedata: conn( Disconnecting -> StandAlone )
Nov 12 11:37:33 node002 kernel: drbd homedata: receiver terminated
Nov 12 11:37:33 node002 kernel: drbd homedata: Terminating drbd_r_homedata

But if I disable the other two resource and only have HomeDataClone resource enabled on cluster startup, then drbd device starts connected and both nodes promoted to primary. Here is the log

Nov 12 12:38:11 node002 kernel: drbd homedata: Starting worker thread (from drbdsetup-84 [26752])
Nov 12 12:38:11 node002 kernel: block drbd1: disk( Diskless -> Attaching )
Nov 12 12:38:11 node002 kernel: drbd homedata: Method to ensure write ordering: flush
Nov 12 12:38:11 node002 kernel: block drbd1: max BIO size = 1048576
Nov 12 12:38:11 node002 kernel: block drbd1: drbd_bm_resize called with capacity == 314563128
Nov 12 12:38:11 node002 kernel: block drbd1: resync bitmap: bits=39320391 words=614382 pages=1200
Nov 12 12:38:11 node002 kernel: block drbd1: size = 150 GB (157281564 KB)
Nov 12 12:38:11 node002 kernel: block drbd1: recounting of set bits took additional 7 jiffies
Nov 12 12:38:11 node002 kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 12 12:38:11 node002 kernel: block drbd1: disk( Attaching -> UpToDate )
Nov 12 12:38:11 node002 kernel: block drbd1: attached to UUIDs 01FA7FA3D219A8B4:0000000000000000:0FAB8E4B66817420:0FAA8E4B66817421
Nov 12 12:38:11 node002 kernel: drbd homedata: conn( StandAlone -> Unconnected )
Nov 12 12:38:11 node002 kernel: drbd homedata: Starting receiver thread (from drbd_w_homedata [26753])
Nov 12 12:38:11 node002 kernel: drbd homedata: receiver (re)started
Nov 12 12:38:11 node002 kernel: drbd homedata: conn( Unconnected -> WFConnection )
Nov 12 12:38:11 node002 attrd[26577]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-HomeData (1000)
Nov 12 12:38:11 node002 attrd[26577]:   notice: attrd_perform_update: Sent update 17: master-HomeData=1000
Nov 12 12:38:11 node002 crmd[26579]:   notice: process_lrm_event: Operation HomeData_start_0: ok (node=node002, call=18, rc=0, cib-update=12, confirmed=true)
Nov 12 12:38:11 node002 crmd[26579]:   notice: process_lrm_event: Operation HomeData_notify_0: ok (node=node002, call=19, rc=0, cib-update=0, confirmed=true)
Nov 12 12:38:11 node002 kernel: drbd homedata: Handshake successful: Agreed network protocol version 101
Nov 12 12:38:11 node002 kernel: drbd homedata: Agreed to support TRIM on protocol level
Nov 12 12:38:11 node002 kernel: drbd homedata: Peer authenticated using 20 bytes HMAC
Nov 12 12:38:11 node002 kernel: drbd homedata: conn( WFConnection -> WFReportParams )
Nov 12 12:38:11 node002 kernel: drbd homedata: Starting asender thread (from drbd_r_homedata [26764])
Nov 12 12:38:11 node002 kernel: block drbd1: drbd_sync_handshake:
Nov 12 12:38:11 node002 kernel: block drbd1: self 01FA7FA3D219A8B4:0000000000000000:0FAB8E4B66817420:0FAA8E4B66817421 bits:0 flags:0
Nov 12 12:38:11 node002 kernel: block drbd1: peer 4499ABF2AAE91DF2:01FA7FA3D219A8B5:0FAB8E4B66817421:0FAA8E4B66817421 bits:0 flags:0
Nov 12 12:38:11 node002 kernel: block drbd1: uuid_compare()=-1 by rule 50
Nov 12 12:38:11 node002 kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
Nov 12 12:38:11 node002 kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Nov 12 12:38:11 node002 kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Nov 12 12:38:11 node002 kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID )
Nov 12 12:38:11 node002 kernel: block drbd1: updated sync uuid 01FB7FA3D219A8B4:0000000000000000:0FAB8E4B66817420:0FAA8E4B66817421
Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Nov 12 12:38:11 node002 kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Nov 12 12:38:11 node002 kernel: block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
Nov 12 12:38:11 node002 kernel: block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Nov 12 12:38:11 node002 kernel: block drbd1: updated UUIDs 4499ABF2AAE91DF2:0000000000000000:01FB7FA3D219A8B4:01FA7FA3D219A8B5
Nov 12 12:38:11 node002 kernel: block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
Nov 12 12:38:11 node002 crm-unfence-peer.sh[26804]: invoked for homedata
Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 0 (0x0)
Nov 12 12:38:12 node002 crmd[26579]:   notice: process_lrm_event: Operation dlm_stop_0: ok (node=node002, call=17, rc=0, cib-update=13, confirmed=true)
Nov 12 12:38:13 node002 crmd[26579]:   notice: process_lrm_event: Operation HomeData_notify_0: ok (node=node002, call=20, rc=0, cib-update=0, confirmed=true)
Nov 12 12:38:13 node002 kernel: block drbd1: peer( Secondary -> Primary )
Nov 12 12:38:13 node002 kernel: block drbd1: role( Secondary -> Primary )
Nov 12 12:38:13 node002 crmd[26579]:   notice: process_lrm_event: Operation HomeData_promote_0: ok (node=node002, call=21, rc=0, cib-update=14, confirmed=true)
Nov 12 12:38:13 node002 attrd[26577]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-HomeData (10000)
Nov 12 12:38:13 node002 attrd[26577]:   notice: attrd_perform_update: Sent update 23: master-HomeData=10000
Nov 12 12:38:13 node002 crmd[26579]:   notice: process_lrm_event: Operation HomeData_notify_0: ok (node=node002, call=22, rc=0, cib-update=0, confirmed=true)

So based on the result above, then I tried to add constraint order to start HomeDataClone then start dlm-clone while still disabling HomeFS-clone. The result is drbd startup both connected  and become primary but it seems dlm-clone cannot be started after HomeDataClone and that also caused dlm-clone service to be stopped and not running.

Master/Slave Set: HomeDataClone [HomeData]
     Masters: [ node001 node002 ]
Clone Set: HomeFS-clone [HomeFS]
     Stopped: [ node001 node002 ]
Clone Set: dlm-clone [dlm]
     Stopped: [ node001 node002 ]

Failed actions:
    dlm_start_0 on node001 'not configured' (6): call=21, status=complete, last-rc-change='Wed Nov 12 12:46:42 2014', queued=0ms, exec=59ms
    dlm_start_0 on node002 'not configured' (6): call=21, status=complete, last-rc-change='Wed Nov 12 12:46:40 2014', queued=0ms, exec=69ms

Please help to answer some of my questions

1.       Why pacemaker controld failed if I set the constraint order to start drbd resource first then after that start controld?

2.       Is there a configuration to enable drbd promote delay? For example after start drbd service then wait before promoting the resource to primary

So right now, I am stuck on getting cluster running and managed all the resources without any error at all. My temporary solution is to start the drbd service manually first before using pcs to start up the cluster. Of course this is not the best practices so I would like to ask any advice or feedback to fix this issue.

Thanks before for any hint/advice.

___________________________________________________________________
This email is intended for the designated recipient(s) only, and may be confidential, non-public, proprietary, protected by the attorney/client or other privilege. Unauthorized reading, distribution, copying or other use of this communication is prohibited and may be unlawful. Receipt by anyone other than the intended recipient(s) should not be deemed a waiver of any privilege or protection. If you are not the intended recipient or if you believe that you have received this email in error, please notify the sender immediately and delete all copies from your computer system without reading, saving, or using it in any manner. Although it has been checked for viruses and other malicious software ("malware"), we do not warrant, represent or guarantee in any way that this communication is free of malware or potentially damaging defects. All liability for any actual or alleged loss, damage, or injury arising out of or resulting in any way from the receipt, opening or use of this email is expressly disclaimed.
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20141112/77469fd8/attachment-0001.html>