[Pacemaker] Failure after intermittent network outage

Sun Mar 6 21:05:04 UTC 2011

Hi everyone.

We have three-node cluster using Pacemaker 1.0.10 on RHEL5.5. Two nodes 
(wapgw1-1 and wapgw1-2) are configured for DRBD and running virtual 
machines over it. Third node (wapgw1-log) is mostly a quorum server, 
i.e. it has not libvirtd nor DRBD installed. There are location 
constraints which allow resources to run on "real" nodes only.

All three nodes are connected to network over bonded links at 
active-backup mode.

STONITH had been configured but unavailable at the moment. It's bad, I know.

The problem come when one of two interfaces on quorum node (wapgw1-log) 
went down. It was not first time, and previously this did not cause any 
harm.

Corosync has lost connectivity, cluster has fallen into partitions.

Mar  1 11:15:58 wapgw1-log corosync[24536]:   [TOTEM ] A processor 
failed, forming new configuration.
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 3500: memb=1, 
new=0, lost=2
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: memb: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: lost: wapgw1-1 1098666762
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: lost: wapgw1-2 1115443978
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 3500: memb=1, new=0, 
lost=0
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node wapgw1-2 was not seen in the previous 
transition
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
update_member: Node 1115443978/wapgw1-2 is now: lost
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node wapgw1-1 was not seen in the previous 
transition
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
update_member: Node 1098666762/wapgw1-1 is now: lost
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
send_member_notification: Sending membership update 3500 to 2 children
Mar  1 11:15:59 wapgw1-log crmd: [24547]: notice: ais_dispatch: 
Membership 3500: quorum lost
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [TOTEM ] A processor 
joined or left the membership and a new membership was formed.
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch: 
Membership 3500: quorum lost
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log crmd: [24547]: WARN: check_dead_member: Our 
DC node (wapgw1-2) left the cluster
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-2
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_te_control: 
Registering TE UUID: 1be865f6-557d-45c4-b549-c10dbab5acc4
Mar  1 11:15:59 wapgw1-log crmd: [24547]: WARN: 
cib_client_add_notify_callback: Callback already present
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: set_graph_functions: 
Setting custom graph functions
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: unpack_graph: Unpacked 
transition -1: 0 actions in 0 synapses
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_takeover: Taking 
over DC status for this partition
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_readwrite: We 
are now in R/W mode

DC node has noticed member loss:

Mar  1 11:15:59 wapgw1-2 pengine: [5748]: WARN: pe_fence_node: Node 
wapgw1-log will be fenced because it is un-expectedly down
Mar  1 11:15:59 wapgw1-2 pengine: [5748]: info: 
determine_online_status_fencing:     ha_state=active, ccm_state=false, 
crm_state=online, join_state=member, expected=member
Mar  1 11:15:59 wapgw1-2 pengine: [5748]: WARN: determine_online_status: 
Node wapgw1-log is unclean

But fencing was unavailable, so sad story continues.

Bonding switched to backup link. It took a second. Back on quorum node:

Mar  1 11:15:59 wapgw1-log kernel: bonding: bond0: link status 
definitely down for interface eth0, disabling it
Mar  1 11:15:59 wapgw1-log kernel: bonding: bond0: making interface eth1 
the new active one.
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_modify for section crm_config 
(origin=local/crmd/169, version=0.797.24): ok (rc=0)
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: crm_ais_dispatch: 
Setting expected votes to 3
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: All 
1 cluster nodes responded to the join offer.
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_join_finalize: 
join-5: Syncing the CIB from wapgw1-log to the rest of the cluster
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: te_connect_stonith: 
Attempting connection to fencing daemon...
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_modify for section crm_config 
(origin=local/crmd/172, version=0.797.24): ok (rc=0)
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_sync for section 'all' 
(origin=local/crmd/173, version=0.797.24): ok (rc=0)
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 3504: memb=1, 
new=0, lost=0
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: memb: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=2, 
lost=0
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
update_member: Node 1098666762/wapgw1-1 is now: member
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: NEW:  wapgw1-1 1098666762
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
update_member: Node 1115443978/wapgw1-2 is now: member
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: NEW:  wapgw1-2 1115443978
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-1 1098666762
Mar  1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch: 
Membership 3504: quorum acquired
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-2 1115443978
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
send_member_notification: Sending membership update 3504 to 2 children
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [TOTEM ] A processor 
joined or left the membership and a new membership was formed.
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff 
0.797.25 -> 0.797.26 not applied to 0.797.24: current "num_updates" is 
less than required
Mar  1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff: 
Not requesting full refresh in slave mode.
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff 
0.797.26 -> 0.797.27 not applied to 0.797.24: current "num_updates" is 
less than required
Mar  1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff: 
Not requesting full refresh in slave mode.
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff 
0.797.27 -> 0.797.28 not applied to 0.797.24: current "num_updates" is 
less than required
Mar  1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff: 
Not requesting full refresh in slave mode.
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: te_connect_stonith: 
Connected
Mar  1 11:16:00 wapgw1-log crmd: [24547]: notice: ais_dispatch: 
Membership 3504: quorum acquired
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback: 
status: wapgw1-2 is now member (was lost)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback: 
status: wapgw1-1 is now member (was lost)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_quorum: 
Updating quorum status to true (call=181)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_modify for section nodes 
(origin=local/crmd/174, version=0.797.24): ok (rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_delete for section 
//node_state[@uname='wapgw1-2']/lrm (origin=local/crmd/175, 
version=0.797.25): ok (rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_delete for section 
//node_state[@uname='wapgw1-2']/transient_attributes 
(origin=local/crmd/176, version=0.797.26): ok (rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_delete for section 
//node_state[@uname='wapgw1-1']/lrm (origin=local/crmd/177, 
version=0.797.27): ok (rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_delete for section 
//node_state[@uname='wapgw1-1']/transient_attributes 
(origin=local/crmd/178, version=0.797.28): ok (rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_modify for section nodes 
(origin=local/crmd/179, version=0.797.28): ok (rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element: 
cib:diff: - <cib dc-uuid="wapgw1-2" admin_epoch="0" epoch="797" 
num_updates="28" />
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element: 
cib:diff: + <cib dc-uuid="wapgw1-log" admin_epoch="0" epoch="798" 
num_updates="1" />
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request: 
Operation complete: op cib_modify for section cib 
(origin=local/crmd/181, version=0.798.1): ok (rc=0)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: crm_ais_dispatch: 
Setting expected votes to 3
Mar  1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents: 
Archived previous version as /var/lib/heartbeat/crm/cib-57.raw
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: 
Deletion of "//node_state[@uname='wapgw1-2']/lrm": ok (rc=0)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: 
Deletion of "//node_state[@uname='wapgw1-2']/transient_attributes": ok 
(rc=0)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: 
Deletion of "//node_state[@uname='wapgw1-1']/lrm": ok (rc=0)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback: 
Deletion of "//node_state[@uname='wapgw1-1']/transient_attributes": ok 
(rc=0)
Mar  1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents: Wrote 
version 0.798.0 of the CIB to disk (digest: 
2cb8c5e90f8f3d227335340ddeb1914a)

DC node:

Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 3504: memb=2, 
new=0, lost=0
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
pcmk_peer_update: memb: wapgw1-1 1098666762
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
pcmk_peer_update: memb: wapgw1-2 1115443978
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=1, 
lost=0
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: update_member: 
Node 813454090/wapgw1-log is now: member
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
pcmk_peer_update: NEW:  wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-1 1098666762
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-2 1115443978
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [pcmk  ] info: 
send_member_notification: Sending membership update 3504 to 2 children
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [TOTEM ] A processor joined 
or left the membership and a new membership was formed.
Mar  1 11:15:59 wapgw1-2 cib: [5745]: info: ais_dispatch: Membership 
3504: quorum retained
Mar  1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_dispatch: Membership 
3504: quorum retained
Mar  1 11:15:59 wapgw1-2 cib: [5745]: info: crm_update_peer: Node 
wapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48)  
votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_status_callback: 
status: wapgw1-log is now member (was lost)
Mar  1 11:15:59 wapgw1-2 crmd: [5749]: info: crm_update_peer: Node 
wapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48)  
votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312
Mar  1 11:15:59 wapgw1-2 corosync[5738]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

This time, DC decided to check all resources on rejoined node.

Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: unpack_graph: Unpacked 
transition 1353: 37 actions in 37 synapses
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: do_te_invoke: Processing 
graph 1353 (ref=pe_calc-dc-1298967360-1592) derived from 
/var/lib/pengine/pe-input-1525.bz2
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating 
action 30: monitor ilo-wapgw1-1:0_monitor_0 on wapgw1-log
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating 
action 31: monitor ilo-wapgw1-2:0_monitor_0 on wapgw1-log
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating 
action 32: monitor ilo-wapgw1-log:0_monitor_0 on wapgw1-log
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating 
action 33: monitor p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating 
action 34: monitor p-drbd-mdirect1-2:0_monitor_0 on wapgw1-log

And then... 'Suddenly' another DC appeared.

Mar  1 11:16:00 wapgw1-2 crmd: [5749]: ERROR: crmd_ha_msg_filter: 
Another DC detected: wapgw1-log (op=noop)
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_ELECTION [ input=I_ELECTION 
cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: update_dc: Unset DC wapgw1-2

Mar  1 11:16:00 wapgw1-log crmd: [24547]: ERROR: crmd_ha_msg_filter: 
Another DC detected: wapgw1-2 (op=noop)
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_FINALIZE_JOIN -> S_ELECTION [ input=I_ELECTION 
cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: update_dc: Unset DC 
wapgw1-log

All monitor actions had failed like this:

Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op: 
Performing key=33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd 
op=p-drbd-mdirect1-1:0_monitor_0 )
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op: 
Discarding attempt to perform action monitor on p-drbd-mdirect1-1:0 in 
state S_ELECTION
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: send_direct_ack: ACK'ing 
resource op p-drbd-mdirect1-1:0_monitor_0 from 
33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd: 
lrm_invoke-lrmd-1298967360-58
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: process_te_message: 
Processing (N)ACK lrm_invoke-lrmd-1298967360-58 from wapgw1-log
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: process_graph_event: 
Action p-drbd-mdirect1-1:0_monitor_0/33 
(4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) initiated by a 
different transitioner
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: abort_transition_graph: 
process_graph_event:456 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=p-drbd-mdirect1-1:0_monitor_0, 
magic=4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) : Foreign event

Old DC has won elections.

Mar  1 11:16:00 wapgw1-2 crmd: [5749]: info: do_dc_takeover: Taking over 
DC status for this partition

Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_ELECTION -> S_RELEASE_DC [ input=I_RELEASE_DC 
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_dc_release: DC role 
released
Mar  1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_readwrite: We 
are now in R/O mode
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: pe_connection_destroy: 
Connection to the Policy Engine released
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control: 
Transitioner is now inactive
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control: 
Disconnecting STONITH...
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: 
tengine_stonith_connection_destroy: Fencing daemon disconnected
Mar  1 11:16:00 wapgw1-log crmd: [24547]: notice: Not currently connected.
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_RELEASE_DC -> S_PENDING [ input=I_RELEASE_SUCCESS 
cause=C_FSA_INTERNAL origin=do_dc_release ]
Mar  1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: 
Update 2849 for probe_complete=true failed: The object/attribute does 
not exist
Mar  1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: 
Update 2919 for last-failure-ilo-wapgw1-2:0=1298967173 failed: The 
object/attribute does not exist
Mar  1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: 
Update 2924 for last-failure-ilo-wapgw1-1:0=1298967173 failed: The 
object/attribute does not exist
Mar  1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: 
Update 2944 for fail-count-ilo-wapgw1-2:0=INFINITY failed: The 
object/attribute does not exist
Mar  1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback: 
Update 2947 for fail-count-ilo-wapgw1-1:0=INFINITY failed: The 
object/attribute does not exist
Mar  1 11:16:01 wapgw1-log crmd: [24547]: info: update_dc: Set DC to 
wapgw1-2 (3.0.1)

After a while, DC had remembered all those actions:

Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=20000, abort_level=1000000, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 33]: In-flight (id: 
p-drbd-mdirect1-1:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
33: p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log timed out
...
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op ilo-wapgw1-log:0_monitor_0 on wapgw1-log: unknown 
error (1)
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log: unknown 
error (1)
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log: 
unknown error (1)
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op p-drbd-mrouter1-1:0_monitor_0 on wapgw1-log: 
unknown error (1)
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op vm-mdirect1-1_monitor_0 on wapgw1-log: unknown 
error (1)
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown error (1)
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: 
Processing failed op vm-mdirect1-2_monitor_0 on wapgw1-log: unknown 
error (1)
...
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running: 
Resource ocf::VirtualDomain:vm-mdirect1-1 appears to be active on 2 nodes.
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See 
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running: 
Resource ocf::VirtualDomain:vm-mproxy1-1 appears to be active on 2 nodes.
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See 
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running: 
Resource ocf::VirtualDomain:vm-mrouter1-1 appears to be active on 2 nodes.
Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See 
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
...

So it tried to stop all resources. And it succeeded on real nodes, but 
not on quorum node:

Mar  1 11:17:21 wapgw1-log VirtualDomain[24851]: ERROR: Setup problem: 
Couldn't find utility virsh
Mar  1 11:17:21 wapgw1-log VirtualDomain[24852]: ERROR: Setup problem: 
Couldn't find utility virsh
Mar  1 11:17:21 wapgw1-log VirtualDomain[24853]: ERROR: Setup problem: 
Couldn't find utility virsh
Mar  1 11:17:21 wapgw1-log VirtualDomain[24850]: ERROR: Setup problem: 
Couldn't find utility virsh
Mar  1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM 
operation vm-mdirect1-2_stop_0 (call=44, rc=5, cib-update=188, 
confirmed=true) not installed
Mar  1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM 
operation vm-mproxy1-1_stop_0 (call=45, rc=5, cib-update=189, 
confirmed=true) not installed
Mar  1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM 
operation vm-mproxy1-2_stop_0 (call=46, rc=5, cib-update=190, 
confirmed=true) not installed
Mar  1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM 
operation vm-mdirect1-1_stop_0 (call=43, rc=5, cib-update=191, 
confirmed=true) not installed

Then the cluster is stuck. Resources cannot be stopped on quorum node as 
there are no such resources. All resources on real nodes are down. 
Pengine believes the resources are started on quorum node.

Mar  1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: 
vm-mdirect1-1    (ocf::heartbeat:VirtualDomain):    Started wapgw1-log 
FAILED
Mar  1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: 
vm-mdirect1-2    (ocf::heartbeat:VirtualDomain):    Started wapgw1-log
Mar  1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: 
vm-mproxy1-1    (ocf::heartbeat:VirtualDomain):    Started wapgw1-log
Mar  1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print: 
vm-mproxy1-2    (ocf::heartbeat:VirtualDomain):    Started wapgw1-log

End of story. I have full hb_report, so let me know if it may help.

I understand that quorum node could be fenced to resolve this situation. 
But fencing by reset is not a very good thing, particularly, for some 
non-clustered services running on this node. Moreover, fencing was not 
really needed, as these services cannot ever run on the node. The 
cluster would continue fine if the quorum node could confirm that no 
resources run on it.

Was there something that Pacemaker could do better? For example:

1) Retry failed monitor actions somehow;
2) Maybe accept "not installed" on stop like a confirmation. I think 
that would be dangerous, though, if the node has failed in some screwed way;
3) Or could there be configuration knob saying "the resource do not 
exists on the node"? So the cluster don't try to run nonexistent 
resources when this is impossible.

What is the right way to avoid cluster split when bonding switches? I 
have adjusted 'token' directive in corosync.conf to 5000, is this OK?

Thank you for you time.

--
Pavel Levshin //flicker