[Pacemaker] node offline after fencing (pacemakerd hangs)

Tue Jul 17 13:24:29 UTC 2012

hi,

i have setup a very basic 2-node cluster on RHEL 6.3
first thing i tried was to setup stonith/fencing_ipmilan
resource.

fencing seems to work,  if i kill corosync on one node
it is restarted (ipmi reboot) by the other node.  

but after restart the cluster doesn't come back to normal
operation,   i looks like the pacemakerd hangs and the
node status is offline.

i found only one way to fix the problem:

killall -9 pacemakerd
service pacemakerd start

after that both nodes are online.  below you can see my
cluster configuration and the corosync.log messages which
repeat forever when pacemakerd hangs.

i am new to pacemaker and followed the "Clusters from Scratch"
guide for the first setup.   information about fence_ipmilan
is from google :-)

can u give me tips ?? what is wrong with this basic cluster
config.  i don't want to add more resources (kvm virtual
machines) until fencing is configured correctly.

thx
ulrich

[root at pcmk1 ~]# crm configure show
node pcmk1 \
	attributes standby="off"
node pcmk2 \
	attributes standby="off"
primitive p_stonith_pcmk1 stonith:fence_ipmilan \
	params auth="password" ipaddr="192.168.120.171" passwd="xxx" lanplus="true" login="pcmk" timeout="20s" power_wait="5s" verbose="true" pcmk_host_check="static-list" pcmk_host_list="pcmk1" \
	meta target-role="started"
primitive p_stonith_pcmk2 stonith:fence_ipmilan \
	params auth="password" ipaddr="192.168.120.172" passwd="xxx" lanplus="true" login="pcmk" timeout="20s" power_wait="5s" verbose="true" pcmk_host_check="static-list" pcmk_host_list="pcmk2" \
	meta target-role="started"
location loc_p_stonith_pcmk1_pcmk1 p_stonith_pcmk1 -inf: pcmk1
location loc_p_stonith_pcmk2_pcmk2 p_stonith_pcmk2 -inf: pcmk2
property $id="cib-bootstrap-options" \
	expected-quorum-votes="2" \
	dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
	no-quorum-policy="ignore" \
	cluster-infrastructure="openais"
rsc_defaults $id="rsc-options" \
	resource-stickiness="200"

/var/log/cluster/corosync.log:

Jul 13 11:29:41 [1859] pcmk2       crmd:     info: do_dc_release:       DC role released
Jul 13 11:29:41 [1859] pcmk2       crmd:     info: do_te_control:       Transitioner is now inactive
Jul 13 11:29:41 [1854] pcmk2        cib:     info: set_crm_log_level:   New log level: 3 0
Jul 13 11:30:01 [1859] pcmk2       crmd:     info: crm_timer_popped:    Election Trigger (I_DC_TIMEOUT) just popped (20000ms)
Jul 13 11:30:01 [1859] pcmk2       crmd:  warning: do_log:      FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Jul 13 11:30:01 [1859] pcmk2       crmd:   notice: do_state_transition:         State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_poppe
d ]
Jul 13 11:30:01 [1859] pcmk2       crmd:     info: do_election_count_vote:      Election 8 (owner: pcmk1) lost: vote from pcmk1 (Uptime)
Jul 13 11:30:01 [1859] pcmk2       crmd:   notice: do_state_transition:         State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_
vote ]

-- 
Ulrich Leodolter <ulrich.leodolter at obvsg.at>
OBVSG