[Pacemaker] Recovery after lost quorum

Tue Jun 4 19:22:25 EDT 2013

Am 05.06.2013 um 00:52 schrieb Andrew Beekhof <andrew at beekhof.net>:

>> been restored the resources aren't restarted. Running crm_resource -P
>> brings anything up, but of course it would be nice if this happens
>> automatically. Is there any way to archive this?
> 
> It should happen automatically.
> Logs?

Hi Andrew,

thanks for your reply.

Here are the logs:

Jun  5 01:11:03 test4 pengine: [18625]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
Jun  5 01:11:03 test4 pengine: [18625]: notice: LogActions: Start   pingtest:0#011(test4 - blocked)
Jun  5 01:11:03 test4 pengine: [18625]: notice: LogActions: Start   drbd:0#011(test4 - blocked)
Jun  5 01:11:03 test4 crmd: [18626]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun  5 01:11:03 test4 crmd: [18626]: info: do_te_invoke: Processing graph 9 (ref=pe_calc-dc-1370387463-81) derived from /var/lib/pengine/pe-input-108.bz2
Jun  5 01:11:03 test4 crmd: [18626]: notice: run_graph: ==== Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-108.bz2): Complete
Jun  5 01:11:03 test4 crmd: [18626]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun  5 01:11:03 test4 pengine: [18625]: notice: process_pe_message: Transition 9: PEngine Input stored in: /var/lib/pengine/pe-input-108.bz2
Jun  5 01:11:06 test4 attrd: [18624]: notice: attrd_trigger_update: Sending flush op to all hosts for: pingd (<null>)
Jun  5 01:11:06 test4 attrd: [18624]: notice: attrd_perform_update: Sent delete 33: node=test4, attr=pingd, id=<n/a>, set=(null), section=status
Jun  5 01:11:06 test4 crmd: [18626]: info: abort_transition_graph: te_update_diff:194 - Triggered transition abort (complete=1, tag=transient_attributes, id=test4, magic=NA, cib=0.43.98) : Transient attribute: removal
Jun  5 01:11:06 test4 crmd: [18626]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun  5 01:11:06 test4 pengine: [18625]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
Jun  5 01:11:06 test4 pengine: [18625]: notice: LogActions: Start   pingtest:0#011(test4 - blocked)
Jun  5 01:11:06 test4 pengine: [18625]: notice: LogActions: Start   drbd:0#011(test4 - blocked)
Jun  5 01:11:06 test4 crmd: [18626]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun  5 01:11:06 test4 crmd: [18626]: info: do_te_invoke: Processing graph 10 (ref=pe_calc-dc-1370387466-82) derived from /var/lib/pengine/pe-input-109.bz2
Jun  5 01:11:06 test4 crmd: [18626]: notice: run_graph: ==== Transition 10 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-109.bz2): Complete
Jun  5 01:11:06 test4 crmd: [18626]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun  5 01:11:06 test4 pengine: [18625]: notice: process_pe_message: Transition 10: PEngine Input stored in: /var/lib/pengine/pe-input-109.bz2
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 128: memb=1, new=0, lost=0
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: memb: test4 1835313344
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 128: memb=2, new=1, lost=0
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: update_member: Node 526690496/backup3 is now: member
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: NEW:  backup3 526690496
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: MEMB: backup3 526690496
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: pcmk_peer_update: MEMB: test4 1835313344
Jun  5 01:11:11 test4 corosync[18598]:   [pcmk  ] info: send_member_notification: Sending membership update 128 to 2 children
Jun  5 01:11:11 test4 corosync[18598]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun  5 01:11:11 test4 crmd: [18626]: notice: ais_dispatch_message: Membership 128: quorum acquired
Jun  5 01:11:11 test4 crmd: [18626]: info: ais_status_callback: status: backup3 is now member (was lost)
Jun  5 01:11:11 test4 crmd: [18626]: WARN: match_down_event: No match for shutdown action on backup3
Jun  5 01:11:11 test4 cib: [18621]: notice: ais_dispatch_message: Membership 128: quorum acquired
Jun  5 01:11:11 test4 crmd: [18626]: info: erase_status_tag: Deleting xpath: //node_state[@uname='backup3']/lrm
Jun  5 01:11:11 test4 cib: [18621]: info: crm_update_peer: Node backup3: id=526690496 state=member (new) addr=r(0) ip(192.168.100.31)  votes=1 born=120 seen=128 proc=00000000000000000000000000111312
Jun  5 01:11:11 test4 crmd: [18626]: info: erase_status_tag: Deleting xpath: //node_state[@uname='backup3']/transient_attributes
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='backup3']/lrm (origin=local/crmd/113, version=0.43.99): ok (rc=0)
Jun  5 01:11:11 test4 crmd: [18626]: info: crm_update_peer: Node backup3: id=526690496 state=member (new) addr=r(0) ip(192.168.100.31)  votes=1 born=120 seen=128 proc=00000000000000000000000000111312
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='backup3']/transient_attributes (origin=local/crmd/114, version=0.43.100): ok (rc=0)
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/115, version=0.43.101): ok (rc=0)
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/117, version=0.43.103): ok (rc=0)
Jun  5 01:11:11 test4 crmd: [18626]: info: crmd_ais_dispatch: Setting expected votes to 3
Jun  5 01:11:11 test4 cib: [18621]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/119, version=0.43.104): ok (rc=0)
Jun  5 01:11:11 test4 crmd: [18626]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=drbd:0_last_failure_0, magic=0:5;6:2:7:ca6fd6dd-bf3c-4b75-9214-26e3a36b2f1c, cib=0.43.99) : Resource op removal
Jun  5 01:11:11 test4 crmd: [18626]: info: abort_transition_graph: te_update_diff:194 - Triggered transition abort (complete=1, tag=transient_attributes, id=backup3, magic=NA, cib=0.43.100) : Transient attribute: removal
Jun  5 01:11:11 test4 corosync[18598]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.109) ; members(old:1 left:0)
Jun  5 01:11:11 test4 crmd: [18626]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun  5 01:11:11 test4 crmd: [18626]: info: do_state_transition: Membership changed: 120 -> 128 - join restart
Jun  5 01:11:11 test4 crmd: [18626]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=do_state_transition ]
Jun  5 01:11:11 test4 corosync[18598]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun  5 01:11:11 test4 crmd: [18626]: info: join_make_offer: Making join offers based on membership 128
Jun  5 01:11:11 test4 crmd: [18626]: info: do_dc_join_offer_all: join-4: Waiting on 2 outstanding join acks
Jun  5 01:11:11 test4 crmd: [18626]: info: update_dc: Set DC to test4 (3.0.6)

Please notice that at the moment there are only two of the three nodes online, but quorum is established, as expected. Both nodes are running corosync and pacemaker, but the second node didn't have any of the configured resources (so i got "not installed" errors there, usually pacemaker is disabled on this node). The resources aren't started as well if pacemaker is disabled on this node (only corosync).

analysis.txt from hb_report states:

Log patterns:
Jun  5 01:14:11 test4 crmd: [18626]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)

My config:

node backup3 \
	attributes standby="off"
node test3
node test4
primitive apache lsb:apache2 \
	op monitor interval="10" timeout="20" \
	meta target-role="Started"
primitive drbd ocf:linbit:drbd \
	params drbd_resource="www_r0" \
	op monitor interval="10"
primitive fs_drbd ocf:heartbeat:Filesystem \
	params device="/dev/drbd0" directory="/var/www" fstype="ext4" \
	op monitor interval="5" \
	meta target-role="Started"
primitive pingtest ocf:pacemaker:ping \
	params multiplier="1000" host_list="192.168.100.19" \
	op monitor interval="5"
primitive sip ocf:heartbeat:IPaddr2 \
	params ip="192.168.100.30" nic="eth0" \
	op monitor interval="10" timeout="20" \
	meta target-role="Started"
group grp_all sip fs_drbd apache
ms ms_drbd drbd \
	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
clone clone_pingtest pingtest
location loc_all_on_best_ping grp_all \
	rule $id="loc_all_on_best_ping-rule" -inf: not_defined pingd or pingd lt 1000
colocation coloc_all_on_drbd inf: grp_all ms_drbd:Master
order order_all_after_drbd inf: ms_drbd:promote grp_all:start
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="3" \
	no-quorum-policy="stop" \
	stonith-enabled="false" \
	last-lrm-refresh="1370360692" \
	default-resource-stickiness="100" \
	maintenance-mode="false"

Best regards,
Denis Witt