[Pacemaker] Failure after intermittent network outage
Pavel Levshin
pavel at levshin.spb.ru
Sun Mar 6 21:05:04 UTC 2011
Hi everyone.
We have three-node cluster using Pacemaker 1.0.10 on RHEL5.5. Two nodes
(wapgw1-1 and wapgw1-2) are configured for DRBD and running virtual
machines over it. Third node (wapgw1-log) is mostly a quorum server,
i.e. it has not libvirtd nor DRBD installed. There are location
constraints which allow resources to run on "real" nodes only.
All three nodes are connected to network over bonded links at
active-backup mode.
STONITH had been configured but unavailable at the moment. It's bad, I know.
The problem come when one of two interfaces on quorum node (wapgw1-log)
went down. It was not first time, and previously this did not cause any
harm.
Corosync has lost connectivity, cluster has fallen into partitions.
Mar 1 11:15:58 wapgw1-log corosync[24536]: [TOTEM ] A processor
failed, forming new configuration.
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:
pcmk_peer_update: Transitional membership event on ring 3500: memb=1,
new=0, lost=2
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: memb: wapgw1-log 813454090
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: lost: wapgw1-1 1098666762
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: lost: wapgw1-2 1115443978
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:
pcmk_peer_update: Stable membership event on ring 3500: memb=1, new=0,
lost=0
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
ais_mark_unseen_peer_dead: Node wapgw1-2 was not seen in the previous
transition
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
update_member: Node 1115443978/wapgw1-2 is now: lost
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
ais_mark_unseen_peer_dead: Node wapgw1-1 was not seen in the previous
transition
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
update_member: Node 1098666762/wapgw1-1 is now: lost
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
send_member_notification: Sending membership update 3500 to 2 children
Mar 1 11:15:59 wapgw1-log crmd: [24547]: notice: ais_dispatch:
Membership 3500: quorum lost
Mar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node
wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)
votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch:
Membership 3500: quorum lost
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node
wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)
votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node
wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)
votes=1 born=3400 seen=3496 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN: check_dead_member: Our
DC node (wapgw1-2) left the cluster
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition:
State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node
wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)
votes=1 born=3404 seen=3496 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-2
Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition:
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_te_control:
Registering TE UUID: 1be865f6-557d-45c4-b549-c10dbab5acc4
Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN:
cib_client_add_notify_callback: Callback already present
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: set_graph_functions:
Setting custom graph functions
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: unpack_graph: Unpacked
transition -1: 0 actions in 0 synapses
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_takeover: Taking
over DC status for this partition
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_readwrite: We
are now in R/W mode
DC node has noticed member loss:
Mar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: pe_fence_node: Node
wapgw1-log will be fenced because it is un-expectedly down
Mar 1 11:15:59 wapgw1-2 pengine: [5748]: info:
determine_online_status_fencing: ha_state=active, ccm_state=false,
crm_state=online, join_state=member, expected=member
Mar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: determine_online_status:
Node wapgw1-log is unclean
But fencing was unavailable, so sad story continues.
Bonding switched to backup link. It took a second. Back on quorum node:
Mar 1 11:15:59 wapgw1-log kernel: bonding: bond0: link status
definitely down for interface eth0, disabling it
Mar 1 11:15:59 wapgw1-log kernel: bonding: bond0: making interface eth1
the new active one.
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/169, version=0.797.24): ok (rc=0)
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_ais_dispatch:
Setting expected votes to 3
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition:
State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
cause=C_FSA_INTERNAL origin=check_join_state ]
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: All
1 cluster nodes responded to the join offer.
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_join_finalize:
join-5: Syncing the CIB from wapgw1-log to the rest of the cluster
Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: te_connect_stonith:
Attempting connection to fencing daemon...
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/172, version=0.797.24): ok (rc=0)
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_sync for section 'all'
(origin=local/crmd/173, version=0.797.24): ok (rc=0)
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:
pcmk_peer_update: Transitional membership event on ring 3504: memb=1,
new=0, lost=0
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: memb: wapgw1-log 813454090
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice:
pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=2,
lost=0
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
update_member: Node 1098666762/wapgw1-1 is now: member
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: NEW: wapgw1-1 1098666762
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
update_member: Node 1115443978/wapgw1-2 is now: member
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: NEW: wapgw1-2 1115443978
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-1 1098666762
Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch:
Membership 3504: quorum acquired
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-2 1115443978
Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info:
send_member_notification: Sending membership update 3504 to 2 children
Mar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node
wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)
votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node
wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)
votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node
wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)
votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node
wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)
votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff
0.797.25 -> 0.797.26 not applied to 0.797.24: current "num_updates" is
less than required
Mar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff:
Not requesting full refresh in slave mode.
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff
0.797.26 -> 0.797.27 not applied to 0.797.24: current "num_updates" is
less than required
Mar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff:
Not requesting full refresh in slave mode.
Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_diff: Diff
0.797.27 -> 0.797.28 not applied to 0.797.24: current "num_updates" is
less than required
Mar 1 11:15:59 wapgw1-log cib: [24543]: WARN: cib_server_process_diff:
Not requesting full refresh in slave mode.
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: te_connect_stonith:
Connected
Mar 1 11:16:00 wapgw1-log crmd: [24547]: notice: ais_dispatch:
Membership 3504: quorum acquired
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback:
status: wapgw1-2 is now member (was lost)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Node
wapgw1-2: id=1115443978 state=member (new) addr=r(0) ip(10.83.124.66)
votes=1 born=3400 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: ais_status_callback:
status: wapgw1-1 is now member (was lost)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_peer: Node
wapgw1-1: id=1098666762 state=member (new) addr=r(0) ip(10.83.124.65)
votes=1 born=3404 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_update_quorum:
Updating quorum status to true (call=181)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_modify for section nodes
(origin=local/crmd/174, version=0.797.24): ok (rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_delete for section
//node_state[@uname='wapgw1-2']/lrm (origin=local/crmd/175,
version=0.797.25): ok (rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_delete for section
//node_state[@uname='wapgw1-2']/transient_attributes
(origin=local/crmd/176, version=0.797.26): ok (rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_delete for section
//node_state[@uname='wapgw1-1']/lrm (origin=local/crmd/177,
version=0.797.27): ok (rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_delete for section
//node_state[@uname='wapgw1-1']/transient_attributes
(origin=local/crmd/178, version=0.797.28): ok (rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_modify for section nodes
(origin=local/crmd/179, version=0.797.28): ok (rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element:
cib:diff: - <cib dc-uuid="wapgw1-2" admin_epoch="0" epoch="797"
num_updates="28" />
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: log_data_element:
cib:diff: + <cib dc-uuid="wapgw1-log" admin_epoch="0" epoch="798"
num_updates="1" />
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_request:
Operation complete: op cib_modify for section cib
(origin=local/crmd/181, version=0.798.1): ok (rc=0)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: crm_ais_dispatch:
Setting expected votes to 3
Mar 1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents:
Archived previous version as /var/lib/heartbeat/crm/cib-57.raw
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:
Deletion of "//node_state[@uname='wapgw1-2']/lrm": ok (rc=0)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:
Deletion of "//node_state[@uname='wapgw1-2']/transient_attributes": ok
(rc=0)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:
Deletion of "//node_state[@uname='wapgw1-1']/lrm": ok (rc=0)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: erase_xpath_callback:
Deletion of "//node_state[@uname='wapgw1-1']/transient_attributes": ok
(rc=0)
Mar 1 11:16:00 wapgw1-log cib: [24721]: info: write_cib_contents: Wrote
version 0.798.0 of the CIB to disk (digest:
2cb8c5e90f8f3d227335340ddeb1914a)
DC node:
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] notice:
pcmk_peer_update: Transitional membership event on ring 3504: memb=2,
new=0, lost=0
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
pcmk_peer_update: memb: wapgw1-1 1098666762
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
pcmk_peer_update: memb: wapgw1-2 1115443978
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] notice:
pcmk_peer_update: Stable membership event on ring 3504: memb=3, new=1,
lost=0
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info: update_member:
Node 813454090/wapgw1-log is now: member
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
pcmk_peer_update: NEW: wapgw1-log 813454090
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-1 1098666762
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
pcmk_peer_update: MEMB: wapgw1-2 1115443978
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [pcmk ] info:
send_member_notification: Sending membership update 3504 to 2 children
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Mar 1 11:15:59 wapgw1-2 cib: [5745]: info: ais_dispatch: Membership
3504: quorum retained
Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_dispatch: Membership
3504: quorum retained
Mar 1 11:15:59 wapgw1-2 cib: [5745]: info: crm_update_peer: Node
wapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48)
votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: ais_status_callback:
status: wapgw1-log is now member (was lost)
Mar 1 11:15:59 wapgw1-2 crmd: [5749]: info: crm_update_peer: Node
wapgw1-log: id=813454090 state=member (new) addr=r(0) ip(10.83.124.48)
votes=1 born=3464 seen=3504 proc=00000000000000000000000000013312
Mar 1 11:15:59 wapgw1-2 corosync[5738]: [MAIN ] Completed service
synchronization, ready to provide service.
This time, DC decided to check all resources on rejoined node.
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: unpack_graph: Unpacked
transition 1353: 37 actions in 37 synapses
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_te_invoke: Processing
graph 1353 (ref=pe_calc-dc-1298967360-1592) derived from
/var/lib/pengine/pe-input-1525.bz2
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating
action 30: monitor ilo-wapgw1-1:0_monitor_0 on wapgw1-log
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating
action 31: monitor ilo-wapgw1-2:0_monitor_0 on wapgw1-log
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating
action 32: monitor ilo-wapgw1-log:0_monitor_0 on wapgw1-log
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating
action 33: monitor p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: te_rsc_command: Initiating
action 34: monitor p-drbd-mdirect1-2:0_monitor_0 on wapgw1-log
And then... 'Suddenly' another DC appeared.
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: ERROR: crmd_ha_msg_filter:
Another DC detected: wapgw1-log (op=noop)
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: update_dc: Unset DC wapgw1-2
Mar 1 11:16:00 wapgw1-log crmd: [24547]: ERROR: crmd_ha_msg_filter:
Another DC detected: wapgw1-2 (op=noop)
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition:
State transition S_FINALIZE_JOIN -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: update_dc: Unset DC
wapgw1-log
All monitor actions had failed like this:
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op:
Performing key=33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd
op=p-drbd-mdirect1-1:0_monitor_0 )
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op:
Discarding attempt to perform action monitor on p-drbd-mdirect1-1:0 in
state S_ELECTION
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: send_direct_ack: ACK'ing
resource op p-drbd-mdirect1-1:0_monitor_0 from
33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd:
lrm_invoke-lrmd-1298967360-58
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: process_te_message:
Processing (N)ACK lrm_invoke-lrmd-1298967360-58 from wapgw1-log
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: process_graph_event:
Action p-drbd-mdirect1-1:0_monitor_0/33
(4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) initiated by a
different transitioner
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: abort_transition_graph:
process_graph_event:456 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=p-drbd-mdirect1-1:0_monitor_0,
magic=4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) : Foreign event
Old DC has won elections.
Mar 1 11:16:00 wapgw1-2 crmd: [5749]: info: do_dc_takeover: Taking over
DC status for this partition
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition:
State transition S_ELECTION -> S_RELEASE_DC [ input=I_RELEASE_DC
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_dc_release: DC role
released
Mar 1 11:16:00 wapgw1-log cib: [24543]: info: cib_process_readwrite: We
are now in R/O mode
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: pe_connection_destroy:
Connection to the Policy Engine released
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control:
Transitioner is now inactive
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_te_control:
Disconnecting STONITH...
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info:
tengine_stonith_connection_destroy: Fencing daemon disconnected
Mar 1 11:16:00 wapgw1-log crmd: [24547]: notice: Not currently connected.
Mar 1 11:16:00 wapgw1-log crmd: [24547]: info: do_state_transition:
State transition S_RELEASE_DC -> S_PENDING [ input=I_RELEASE_SUCCESS
cause=C_FSA_INTERNAL origin=do_dc_release ]
Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:
Update 2849 for probe_complete=true failed: The object/attribute does
not exist
Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:
Update 2919 for last-failure-ilo-wapgw1-2:0=1298967173 failed: The
object/attribute does not exist
Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:
Update 2924 for last-failure-ilo-wapgw1-1:0=1298967173 failed: The
object/attribute does not exist
Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:
Update 2944 for fail-count-ilo-wapgw1-2:0=INFINITY failed: The
object/attribute does not exist
Mar 1 11:16:01 wapgw1-log attrd: [24545]: WARN: attrd_cib_callback:
Update 2947 for fail-count-ilo-wapgw1-1:0=INFINITY failed: The
object/attribute does not exist
Mar 1 11:16:01 wapgw1-log crmd: [24547]: info: update_dc: Set DC to
wapgw1-2 (3.0.1)
After a while, DC had remembered all those actions:
Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback:
Timer popped (timeout=20000, abort_level=1000000, complete=false)
Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting
transition, action lost: [Action 33]: In-flight (id:
p-drbd-mdirect1-1:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph:
action_timer_callback:486 - Triggered transition abort (complete=0) :
Action lost
Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op
33: p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log timed out
...
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op ilo-wapgw1-log:0_monitor_0 on wapgw1-log: unknown
error (1)
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log: unknown
error (1)
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log:
unknown error (1)
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op p-drbd-mrouter1-1:0_monitor_0 on wapgw1-log:
unknown error (1)
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op vm-mdirect1-1_monitor_0 on wapgw1-log: unknown
error (1)
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown error (1)
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op vm-mdirect1-2_monitor_0 on wapgw1-log: unknown
error (1)
...
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running:
Resource ocf::VirtualDomain:vm-mdirect1-1 appears to be active on 2 nodes.
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running:
Resource ocf::VirtualDomain:vm-mproxy1-1 appears to be active on 2 nodes.
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: ERROR: native_add_running:
Resource ocf::VirtualDomain:vm-mrouter1-1 appears to be active on 2 nodes.
Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: See
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
...
So it tried to stop all resources. And it succeeded on real nodes, but
not on quorum node:
Mar 1 11:17:21 wapgw1-log VirtualDomain[24851]: ERROR: Setup problem:
Couldn't find utility virsh
Mar 1 11:17:21 wapgw1-log VirtualDomain[24852]: ERROR: Setup problem:
Couldn't find utility virsh
Mar 1 11:17:21 wapgw1-log VirtualDomain[24853]: ERROR: Setup problem:
Couldn't find utility virsh
Mar 1 11:17:21 wapgw1-log VirtualDomain[24850]: ERROR: Setup problem:
Couldn't find utility virsh
Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM
operation vm-mdirect1-2_stop_0 (call=44, rc=5, cib-update=188,
confirmed=true) not installed
Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM
operation vm-mproxy1-1_stop_0 (call=45, rc=5, cib-update=189,
confirmed=true) not installed
Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM
operation vm-mproxy1-2_stop_0 (call=46, rc=5, cib-update=190,
confirmed=true) not installed
Mar 1 11:17:21 wapgw1-log crmd: [24547]: info: process_lrm_event: LRM
operation vm-mdirect1-1_stop_0 (call=43, rc=5, cib-update=191,
confirmed=true) not installed
Then the cluster is stuck. Resources cannot be stopped on quorum node as
there are no such resources. All resources on real nodes are down.
Pengine believes the resources are started on quorum node.
Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:
vm-mdirect1-1 (ocf::heartbeat:VirtualDomain): Started wapgw1-log
FAILED
Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:
vm-mdirect1-2 (ocf::heartbeat:VirtualDomain): Started wapgw1-log
Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:
vm-mproxy1-1 (ocf::heartbeat:VirtualDomain): Started wapgw1-log
Mar 1 11:17:50 wapgw1-2 pengine: [5748]: notice: native_print:
vm-mproxy1-2 (ocf::heartbeat:VirtualDomain): Started wapgw1-log
End of story. I have full hb_report, so let me know if it may help.
I understand that quorum node could be fenced to resolve this situation.
But fencing by reset is not a very good thing, particularly, for some
non-clustered services running on this node. Moreover, fencing was not
really needed, as these services cannot ever run on the node. The
cluster would continue fine if the quorum node could confirm that no
resources run on it.
Was there something that Pacemaker could do better? For example:
1) Retry failed monitor actions somehow;
2) Maybe accept "not installed" on stop like a confirmation. I think
that would be dangerous, though, if the node has failed in some screwed way;
3) Or could there be configuration knob saying "the resource do not
exists on the node"? So the cluster don't try to run nonexistent
resources when this is impossible.
What is the right way to avoid cluster split when bonding switches? I
have adjusted 'token' directive in corosync.conf to 5000, is this OK?
Thank you for you time.
--
Pavel Levshin //flicker
More information about the Pacemaker
mailing list