[Pacemaker] Recovery after lost quorum

Wed Jun 5 00:15:54 UTC 2013

On 05/06/2013, at 9:22 AM, Denis Witt <denis.witt at concepts-and-training.de> wrote:

> 
> Am 05.06.2013 um 00:52 schrieb Andrew Beekhof <andrew at beekhof.net>:
> 
>>> been restored the resources aren't restarted. Running crm_resource -P
>>> brings anything up, but of course it would be nice if this happens
>>> automatically. Is there any way to archive this?
>> 
>> It should happen automatically.
>> Logs?
> 
> Hi Andrew,
> 
> thanks for your reply.
> 
> Here are the logs:
> 

[snip]

> Jun  5 01:11:06 test4 pengine: [18625]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
> Jun  5 01:11:06 test4 pengine: [18625]: notice: LogActions: Start   pingtest:0#011(test4 - blocked)
> Jun  5 01:11:06 test4 pengine: [18625]: notice: LogActions: Start   drbd:0#011(test4 - blocked)

Here's your reason.  We didn't get quorum until:

> Jun  5 01:11:11 test4 crmd: [18626]: notice: ais_dispatch_message: Membership 128: quorum acquired

[snipp]

> 
> Please notice that at the moment there are only two of the three nodes online, but quorum is established,

Actually not.

> as expected. Both nodes are running corosync and pacemaker, but the second node didn't have any of the configured resources (so i got "not installed" errors there, usually pacemaker is disabled on this node). The resources aren't started as well if pacemaker is disabled on this node (only corosync).
> 
> analysis.txt from hb_report states:
> 
> Log patterns:
> Jun  5 01:14:11 test4 crmd: [18626]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
> 
> My config:
> 
> node backup3 \
> 	attributes standby="off"
> node test3
> node test4
> primitive apache lsb:apache2 \
> 	op monitor interval="10" timeout="20" \
> 	meta target-role="Started"
> primitive drbd ocf:linbit:drbd \
> 	params drbd_resource="www_r0" \
> 	op monitor interval="10"
> primitive fs_drbd ocf:heartbeat:Filesystem \
> 	params device="/dev/drbd0" directory="/var/www" fstype="ext4" \
> 	op monitor interval="5" \
> 	meta target-role="Started"
> primitive pingtest ocf:pacemaker:ping \
> 	params multiplier="1000" host_list="192.168.100.19" \
> 	op monitor interval="5"
> primitive sip ocf:heartbeat:IPaddr2 \
> 	params ip="192.168.100.30" nic="eth0" \
> 	op monitor interval="10" timeout="20" \
> 	meta target-role="Started"
> group grp_all sip fs_drbd apache
> ms ms_drbd drbd \
> 	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
> clone clone_pingtest pingtest
> location loc_all_on_best_ping grp_all \
> 	rule $id="loc_all_on_best_ping-rule" -inf: not_defined pingd or pingd lt 1000
> colocation coloc_all_on_drbd inf: grp_all ms_drbd:Master
> order order_all_after_drbd inf: ms_drbd:promote grp_all:start
> property $id="cib-bootstrap-options" \
> 	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> 	cluster-infrastructure="openais" \
> 	expected-quorum-votes="3" \
> 	no-quorum-policy="stop" \
> 	stonith-enabled="false" \
> 	last-lrm-refresh="1370360692" \
> 	default-resource-stickiness="100" \
> 	maintenance-mode="false"
> 
> Best regards,
> Denis Witt
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org