[Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh
Digimer
lists at alteeve.ca
Sat Dec 21 20:39:46 UTC 2013
Hi all,
I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta
VMs. I've got stonith configured and it technically works (crashed node
reboots), but pacemaker hangs...
Here is the config:
====
Cluster Name: rhel7-pcmk
Corosync Nodes:
rhel7-01.alteeve.ca rhel7-02.alteeve.ca
Pacemaker Nodes:
rhel7-01.alteeve.ca rhel7-02.alteeve.ca
Resources:
Stonith Devices:
Resource: fence_n01_virsh (class=stonith type=fence_virsh)
Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot
login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s)
Resource: fence_n02_virsh (class=stonith type=fence_virsh)
Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot
login=root passwd_script=/root/lemass.pw port=rhel7_02
Operations: monitor interval=60s (fence_n02_virsh-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Cluster Properties:
cluster-infrastructure: corosync
dc-version: 1.1.10-19.el7-368c726
no-quorum-policy: ignore
stonith-enabled: true
====
Here are the logs:
====
Dec 21 14:36:07 rhel7-01 corosync[1709]: [TOTEM ] A processor failed,
forming new configuration.
Dec 21 14:36:09 rhel7-01 corosync[1709]: [TOTEM ] A new membership
(192.168.122.101:24) was formed. Members left: 2
Dec 21 14:36:09 rhel7-01 corosync[1709]: [QUORUM] Members[1]: 1
Dec 21 14:36:09 rhel7-01 corosync[1709]: [MAIN ] Completed service
synchronization, ready to provide service.
Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now
lost (was member)
Dec 21 14:36:09 rhel7-01 crmd[1730]: warning: reap_dead_nodes: Our DC
node (rhel7-02.alteeve.ca) left the cluster
Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
Dec 21 14:36:09 rhel7-01 pacemakerd[1724]: notice:
crm_update_peer_state: pcmk_quorum_notification: Node
rhel7-02.alteeve.ca[2] - state is now lost (was member)
Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: pe_fence_node: Node
rhel7-02.alteeve.ca will be fenced because fence_n02_virsh is thought to
be active there
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: custom_action: Action
fence_n02_virsh_stop_0 on rhel7-02.alteeve.ca is unrunnable (offline)
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: stage6: Scheduling Node
rhel7-02.alteeve.ca for STONITH
Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: LogActions: Move
fence_n02_virsh (Started rhel7-02.alteeve.ca -> rhel7-01.alteeve.ca)
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: process_pe_message:
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: te_fence_node: Executing
reboot fencing operation (11) on rhel7-02.alteeve.ca (timeout=60000)
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: handle_request:
Client crmd.1730.4f6ea9e1 wants to fence (reboot) 'rhel7-02.alteeve.ca'
with device '(any)'
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
initiate_remote_stonith_op: Initiating remote operation reboot for
rhel7-02.alteeve.ca: ea720bbf-aeab-43bb-a196-3a4c091dea75 (0)
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
can_fence_host_with_device: fence_n01_virsh can not fence
rhel7-02.alteeve.ca: static-list
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
can_fence_host_with_device: fence_n02_virsh can not fence
rhel7-02.alteeve.ca: static-list
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: error: remote_op_done:
Operation reboot of rhel7-02.alteeve.ca by rhel7-01.alteeve.ca for
crmd.1730 at rhel7-01.alteeve.ca.ea720bbf: No such device
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback:
Stonith operation 2/11:0:0:52e1fdf2-0b3a-42be-b7df-4d9dadb8d98b: No such
device (-19)
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback:
Stonith operation 2 for rhel7-02.alteeve.ca failed (No such device):
aborting transition.
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_notify:
Peer rhel7-02.alteeve.ca was not terminated (reboot) by
rhel7-01.alteeve.ca for rhel7-01.alteeve.ca: No such device
(ref=ea720bbf-aeab-43bb-a196-3a4c091dea75) by client crmd.1730
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: run_graph: Transition 0
(Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: too_many_st_failures: No
devices found in cluster to fence rhel7-02.alteeve.ca, giving up
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
====
I've tried with the full host names and with the short host names in
'pcmk_host_list=', but the same result both times.
Versions:
====
pacemaker-1.1.10-19.el7.x86_64
pcs-0.9.99-2.el7.x86_64
====
Can someone hit me with a clustick?
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list