[Pacemaker] node offline after fencing (pacemakerd hangs)
Ulrich Leodolter
ulrich.leodolter at obvsg.at
Wed Jul 18 13:57:43 UTC 2012
hi,
after adding a second ring to corosync.conf
the problem seems to be gone.
after killing corosync the node is fenced by
the other node. after reboot the cluster is
fully operational.
is this essential to have at least 2 rings?
maybe there is a network timing problem (but can't see
error messages)
the interface on ring 0 (192.168.20.171) is a bridge.
the interface on ring 1 (10.10.10.171) is normal ethernet interface.
regards
ulrich
[root at pcmk1 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID -1424709440
RING ID 0
id = 192.168.20.171
status = ring 0 active with no faults
RING ID 1
id = 10.10.10.171
status = ring 1 active with no faults
On Tue, 2012-07-17 at 15:24 +0200, Ulrich Leodolter wrote:
> hi,
>
> i have setup a very basic 2-node cluster on RHEL 6.3
> first thing i tried was to setup stonith/fencing_ipmilan
> resource.
>
> fencing seems to work, if i kill corosync on one node
> it is restarted (ipmi reboot) by the other node.
>
> but after restart the cluster doesn't come back to normal
> operation, i looks like the pacemakerd hangs and the
> node status is offline.
>
> i found only one way to fix the problem:
>
> killall -9 pacemakerd
> service pacemakerd start
>
> after that both nodes are online. below you can see my
> cluster configuration and the corosync.log messages which
> repeat forever when pacemakerd hangs.
>
> i am new to pacemaker and followed the "Clusters from Scratch"
> guide for the first setup. information about fence_ipmilan
> is from google :-)
>
> can u give me tips ?? what is wrong with this basic cluster
> config. i don't want to add more resources (kvm virtual
> machines) until fencing is configured correctly.
>
> thx
> ulrich
>
>
>
> [root at pcmk1 ~]# crm configure show
> node pcmk1 \
> attributes standby="off"
> node pcmk2 \
> attributes standby="off"
> primitive p_stonith_pcmk1 stonith:fence_ipmilan \
> params auth="password" ipaddr="192.168.120.171" passwd="xxx" lanplus="true" login="pcmk" timeout="20s" power_wait="5s" verbose="true" pcmk_host_check="static-list" pcmk_host_list="pcmk1" \
> meta target-role="started"
> primitive p_stonith_pcmk2 stonith:fence_ipmilan \
> params auth="password" ipaddr="192.168.120.172" passwd="xxx" lanplus="true" login="pcmk" timeout="20s" power_wait="5s" verbose="true" pcmk_host_check="static-list" pcmk_host_list="pcmk2" \
> meta target-role="started"
> location loc_p_stonith_pcmk1_pcmk1 p_stonith_pcmk1 -inf: pcmk1
> location loc_p_stonith_pcmk2_pcmk2 p_stonith_pcmk2 -inf: pcmk2
> property $id="cib-bootstrap-options" \
> expected-quorum-votes="2" \
> dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
> no-quorum-policy="ignore" \
> cluster-infrastructure="openais"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="200"
>
>
> /var/log/cluster/corosync.log:
>
> Jul 13 11:29:41 [1859] pcmk2 crmd: info: do_dc_release: DC role released
> Jul 13 11:29:41 [1859] pcmk2 crmd: info: do_te_control: Transitioner is now inactive
> Jul 13 11:29:41 [1854] pcmk2 cib: info: set_crm_log_level: New log level: 3 0
> Jul 13 11:30:01 [1859] pcmk2 crmd: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped (20000ms)
> Jul 13 11:30:01 [1859] pcmk2 crmd: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Jul 13 11:30:01 [1859] pcmk2 crmd: notice: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_poppe
> d ]
> Jul 13 11:30:01 [1859] pcmk2 crmd: info: do_election_count_vote: Election 8 (owner: pcmk1) lost: vote from pcmk1 (Uptime)
> Jul 13 11:30:01 [1859] pcmk2 crmd: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_
> vote ]
>
>
More information about the Pacemaker
mailing list