[Pacemaker] Trying to figure out a constraint
Digimer
lists at alteeve.ca
Thu Jun 19 04:16:54 UTC 2014
On 19/06/14 12:06 AM, Digimer wrote:
<snip>
>
> After sending this, I found that adding:
>
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
>
> Allowed the constraint to be removed, so eventually node 2 (an-a04n02)
> eventually promoted, but not before going into the failed state shown
> above.
>
> Subsequent stop -> start of pacemaker on both nodes started cleanly, not
> fence action reported in /var/log/messages. I notices this time that the
> drbd module was loaded, not sure if that made a difference.
>
> Will keep testing... Any insight is much appreciated.
Ok, that didn't help... It's still resource-fencing on start *most* (not
all) of the time.
When I start pacemaker, and pacemaker start DRBD (nearly simultaneously
on both nodes), I see this:
====
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: do_state_transition:
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 19 00:14:22 an-a04n01 attrd[16893]: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: unpack_config: On
loss of CCM Quorum: Ignore
Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start
fence_n01_ipmi#011(an-a04n01.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start
fence_n02_ipmi#011(an-a04n02.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start
drbd_r0:0#011(an-a04n01.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start
drbd_r0:1#011(an-a04n02.alteeve.ca)
Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: process_pe_message:
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-230.bz2
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 8: monitor fence_n01_ipmi_monitor_0 on
an-a04n02.alteeve.ca
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 4: monitor fence_n01_ipmi_monitor_0 on
an-a04n01.alteeve.ca (local)
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 9: monitor fence_n02_ipmi_monitor_0 on
an-a04n02.alteeve.ca
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 5: monitor fence_n02_ipmi_monitor_0 on
an-a04n01.alteeve.ca (local)
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 6: monitor drbd_r0:0_monitor_0 on an-a04n01.alteeve.ca
(local)
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 10: monitor drbd_r0:1_monitor_0 on an-a04n02.alteeve.ca
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation drbd_r0_monitor_0 (call=14, rc=7, cib-update=28,
confirmed=true) not running
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: process_lrm_event:
an-a04n01.alteeve.ca-drbd_r0_monitor_0:14 [ \n ]
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 3: probe_complete probe_complete on
an-a04n01.alteeve.ca (local) - no waiting
Jun 19 00:14:23 an-a04n01 attrd[16893]: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Jun 19 00:14:23 an-a04n01 attrd[16893]: notice: attrd_perform_update:
Sent update 4: probe_complete=true
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 7: probe_complete probe_complete on
an-a04n02.alteeve.ca - no waiting
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 11: start fence_n01_ipmi_start_0 on
an-a04n01.alteeve.ca (local)
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 13: start fence_n02_ipmi_start_0 on an-a04n02.alteeve.ca
Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 15: start drbd_r0:0_start_0 on an-a04n01.alteeve.ca
(local)
Jun 19 00:14:24 an-a04n01 stonith-ng[16891]: notice:
stonith_device_register: Device 'fence_n01_ipmi' already existed in
device list (2 active devices)
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 17: start drbd_r0:1_start_0 on an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation fence_n01_ipmi_start_0 (call=19, rc=0, cib-update=29,
confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 12: monitor fence_n01_ipmi_monitor_60000 on
an-a04n01.alteeve.ca (local)
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 14: monitor fence_n02_ipmi_monitor_60000 on
an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation fence_n01_ipmi_monitor_60000 (call=24, rc=0, cib-update=30,
confirmed=false) ok
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Starting worker thread
(from cqueue [3265])
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: disk( Diskless ->
Attaching )
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Found 4 transactions (126
active extents) in activity log.
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Method to ensure write
ordering: flush
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: drbd_bm_resize called
with capacity == 909525832
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: resync bitmap:
bits=113690729 words=1776418 pages=3470
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: size = 434 GB (454762916 KB)
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: bitmap READ of 3470 pages
took 8 jiffies
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: recounting of set bits
took additional 16 jiffies
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: disk( Attaching ->
Consistent )
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: attached to UUIDs
561F3328043888C0:0000000000000000:052A1A6B59936EC5:05291A6B59936EC5
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: conn( StandAlone ->
Unconnected )
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Starting receiver thread
(from drbd0_worker [17045])
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: receiver (re)started
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: conn( Unconnected ->
WFConnection )
Jun 19 00:14:24 an-a04n01 attrd[16893]: notice: attrd_trigger_update:
Sending flush op to all hosts for: master-drbd_r0 (5)
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation drbd_r0_start_0 (call=21, rc=0, cib-update=31, confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 48: notify drbd_r0:0_post_notify_start_0 on
an-a04n01.alteeve.ca (local)
Jun 19 00:14:24 an-a04n01 attrd[16893]: notice: attrd_perform_update:
Sent update 9: master-drbd_r0=5
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 49: notify drbd_r0:1_post_notify_start_0 on
an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 attrd[16893]: notice: attrd_perform_update:
Sent update 11: master-drbd_r0=5
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation drbd_r0_notify_0 (call=28, rc=0, cib-update=0, confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: run_graph: Transition 0
(Complete=23, Pending=0, Fired=0, Skipped=2, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-230.bz2): Stopped
Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: unpack_config: On
loss of CCM Quorum: Ignore
Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: LogActions: Promote
drbd_r0:0#011(Slave -> Master an-a04n01.alteeve.ca)
Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: LogActions: Promote
drbd_r0:1#011(Slave -> Master an-a04n02.alteeve.ca)
Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: process_pe_message:
Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-231.bz2
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 52: notify drbd_r0_pre_notify_promote_0 on
an-a04n01.alteeve.ca (local)
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 54: notify drbd_r0_pre_notify_promote_0 on
an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation drbd_r0_notify_0 (call=31, rc=0, cib-update=0, confirmed=true) ok
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 13: promote drbd_r0_promote_0 on an-a04n01.alteeve.ca
(local)
Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command:
Initiating action 16: promote drbd_r0_promote_0 on an-a04n02.alteeve.ca
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: helper command:
/sbin/drbdadm fence-peer minor-0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: Handshake successful:
Agreed network protocol version 97
Jun 19 00:14:25 an-a04n01 crm-fence-peer.sh[17156]: invoked for r0
Jun 19 00:14:25 an-a04n01 cibadmin[17188]: notice: crm_log_args:
Invoked: cibadmin -C -o constraints -X <rsc_location rsc="drbd_r0_Clone"
id="drbd-fence-by-handler-r0-drbd_r0_Clone">#012 <rule role="Master"
score="-INFINITY" id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">#012
<expression attribute="#uname" operation="ne"
value="an-a04n01.alteeve.ca"
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>#012
</rule>#012</rsc_location>
Jun 19 00:14:25 an-a04n01 crmd[16895]: notice: handle_request: Current
ping state: S_TRANSITION_ENGINE
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: Diff: --- 0.94.19
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: Diff: +++
0.95.1 4f095b8add6dcbb173de1254bf02fcf6
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: -- <cib
admin_epoch="0" epoch="94" num_updates="19"/>
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++
<rsc_location rsc="drbd_r0_Clone"
id="drbd-fence-by-handler-r0-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++
<rule role="Master" score="-INFINITY"
id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++
<expression attribute="#uname" operation="ne"
value="an-a04n01.alteeve.ca"
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++ </rule>
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++
</rsc_location>
Jun 19 00:14:25 an-a04n01 stonith-ng[16891]: notice: unpack_config: On
loss of CCM Quorum: Ignore
Jun 19 00:14:25 an-a04n01 crm-fence-peer.sh[17156]: INFO peer is
reachable, my disk is Consistent: placed constraint
'drbd-fence-by-handler-r0-drbd_r0_Clone'
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: helper command:
/sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: fence-peer helper
returned 4 (peer was fenced)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: role( Secondary ->
Primary ) disk( Consistent -> UpToDate ) pdsk( DUnknown -> Outdated )
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: new current UUID
25DF173CF8D89023:561F3328043888C0:052A1A6B59936EC5:05291A6B59936EC5
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: Starting asender thread
(from drbd0_receiver [17062])
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: data-integrity-alg:
<not-used>
Jun 19 00:14:25 an-a04n01 stonith-ng[16891]: notice:
stonith_device_register: Device 'fence_n01_ipmi' already existed in
device list (2 active devices)
Jun 19 00:14:25 an-a04n01 cib[16890]: warning: update_results: Action
cib_create failed: Name not unique on network (cde=-76)
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures <failed>
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures <failed_update
id="drbd-fence-by-handler-r0-drbd_r0_Clone" object_type="rsc_location"
operation="cib_create" reason="Name not unique on network">
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures <rsc_location rsc="drbd_r0_Clone"
id="drbd-fence-by-handler-r0-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures <rule role="Master" score="-INFINITY"
id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures <expression attribute="#uname" operation="ne"
value="an-a04n02.alteeve.ca"
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures </rule>
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures </rsc_location>
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures </failed_update>
Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB
Update failures </failed>
Jun 19 00:14:25 an-a04n01 cib[16890]: warning: cib_process_request:
Completed cib_create operation for section constraints: Name not unique
on network (rc=-76, origin=an-a04n02.alteeve.ca/cibadmin/2, version=0.95.1)
Jun 19 00:14:25 an-a04n01 stonith-ng[16891]: notice:
stonith_device_register: Added 'fence_n02_ipmi' to the device list (2
active devices)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: drbd_sync_handshake:
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: self
25DF173CF8D89023:561F3328043888C0:052A1A6B59936EC5:05291A6B59936EC5
bits:0 flags:0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: peer
561F3328043888C0:0000000000000000:052A1A6B59936EC4:05291A6B59936EC5
bits:0 flags:0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: uuid_compare()=1 by rule 70
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated ->
Consistent )
Jun 19 00:14:25 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM
operation drbd_r0_promote_0 (call=34, rc=0, cib-update=33,
confirmed=true) ok
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: conn( WFBitMapS ->
SyncSource ) pdsk( Consistent -> Inconsistent )
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: Began resync as
SyncSource (will sync 0 KB [0 bits set]).
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: updated sync UUID
25DF173CF8D89023:56203328043888C0:561F3328043888C0:052A1A6B59936EC5
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: Diff: --- 0.95.2
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: Diff: +++
0.96.1 86f147e11a7e9934f7b2a686715dcca6
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: --
<rsc_location rsc="drbd_r0_Clone"
id="drbd-fence-by-handler-r0-drbd_r0_Clone">
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: --
<rule role="Master" score="-INFINITY"
id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: --
<expression attribute="#uname" operation="ne"
value="an-a04n01.alteeve.ca"
id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: -- </rule>
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: --
</rsc_location>
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: ++ <cib
admin_epoch="0" cib-last-written="Thu Jun 19 00:14:26 2014"
crm_feature_set="3.0.7" epoch="96" have-quorum="1" num_updates="1"
update-client="cibadmin" update-origin="an-a04n02.alteeve.ca"
validate-with="pacemaker-1.2" dc-uuid="an-a04n01.alteeve.ca"/>
Jun 19 00:14:26 an-a04n01 stonith-ng[16891]: notice: unpack_config: On
loss of CCM Quorum: Ignore
Jun 19 00:14:26 an-a04n01 stonith-ng[16891]: notice:
stonith_device_register: Device 'fence_n01_ipmi' already existed in
device list (2 active devices)
Jun 19 00:14:26 an-a04n01 stonith-ng[16891]: notice:
stonith_device_register: Added 'fence_n02_ipmi' to the device list (2
active devices)
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: Resync done (total 1 sec;
paused 0 sec; 0 K/sec)
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: updated UUIDs
25DF173CF8D89023:0000000000000000:56203328043888C0:561F3328043888C0
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: bitmap WRITE of 3470
pages took 9 jiffies
Jun 19 00:14:26 an-a04n01 kernel: block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
====
It seems to immediately fence as soon as DRBD starts, and I can't see
why it feels the need to do this...
RHEL 6.5, DRBD 8.3.16.
I am really stumped... any help would be much appreciated!
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list