[Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh

Sat Dec 21 20:39:46 UTC 2013

Hi all,

   I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta 
VMs. I've got stonith configured and it technically works (crashed node 
reboots), but pacemaker hangs...

Here is the config:

====
Cluster Name: rhel7-pcmk
Corosync Nodes:
  rhel7-01.alteeve.ca rhel7-02.alteeve.ca
Pacemaker Nodes:
  rhel7-01.alteeve.ca rhel7-02.alteeve.ca

Resources:

Stonith Devices:
  Resource: fence_n01_virsh (class=stonith type=fence_virsh)
   Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot 
login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
   Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s)
  Resource: fence_n02_virsh (class=stonith type=fence_virsh)
   Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot 
login=root passwd_script=/root/lemass.pw port=rhel7_02
   Operations: monitor interval=60s (fence_n02_virsh-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
  cluster-infrastructure: corosync
  dc-version: 1.1.10-19.el7-368c726
  no-quorum-policy: ignore
  stonith-enabled: true
====

Here are the logs:

====
Dec 21 14:36:07 rhel7-01 corosync[1709]: [TOTEM ] A processor failed, 
forming new configuration.
Dec 21 14:36:09 rhel7-01 corosync[1709]: [TOTEM ] A new membership 
(192.168.122.101:24) was formed. Members left: 2
Dec 21 14:36:09 rhel7-01 corosync[1709]: [QUORUM] Members[1]: 1
Dec 21 14:36:09 rhel7-01 corosync[1709]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: crm_update_peer_state: 
pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now 
lost (was member)
Dec 21 14:36:09 rhel7-01 crmd[1730]: warning: reap_dead_nodes: Our DC 
node (rhel7-02.alteeve.ca) left the cluster
Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State 
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
Dec 21 14:36:09 rhel7-01 pacemakerd[1724]: notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
rhel7-02.alteeve.ca[2] - state is now lost (was member)
Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: unpack_config: On loss 
of CCM Quorum: Ignore
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: pe_fence_node: Node 
rhel7-02.alteeve.ca will be fenced because fence_n02_virsh is thought to 
be active there
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: custom_action: Action 
fence_n02_virsh_stop_0 on rhel7-02.alteeve.ca is unrunnable (offline)
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: stage6: Scheduling Node 
rhel7-02.alteeve.ca for STONITH
Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: LogActions: Move 
fence_n02_virsh	(Started rhel7-02.alteeve.ca -> rhel7-01.alteeve.ca)
Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: process_pe_message: 
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: te_fence_node: Executing 
reboot fencing operation (11) on rhel7-02.alteeve.ca (timeout=60000)
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: handle_request: 
Client crmd.1730.4f6ea9e1 wants to fence (reboot) 'rhel7-02.alteeve.ca' 
with device '(any)'
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: 
initiate_remote_stonith_op: Initiating remote operation reboot for 
rhel7-02.alteeve.ca: ea720bbf-aeab-43bb-a196-3a4c091dea75 (0)
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: 
can_fence_host_with_device: fence_n01_virsh can not fence 
rhel7-02.alteeve.ca: static-list
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: 
can_fence_host_with_device: fence_n02_virsh can not fence 
rhel7-02.alteeve.ca: static-list
Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: error: remote_op_done: 
Operation reboot of rhel7-02.alteeve.ca by rhel7-01.alteeve.ca for 
crmd.1730 at rhel7-01.alteeve.ca.ea720bbf: No such device
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback: 
Stonith operation 2/11:0:0:52e1fdf2-0b3a-42be-b7df-4d9dadb8d98b: No such 
device (-19)
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback: 
Stonith operation 2 for rhel7-02.alteeve.ca failed (No such device): 
aborting transition.
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_notify: 
Peer rhel7-02.alteeve.ca was not terminated (reboot) by 
rhel7-01.alteeve.ca for rhel7-01.alteeve.ca: No such device 
(ref=ea720bbf-aeab-43bb-a196-3a4c091dea75) by client crmd.1730
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: run_graph: Transition 0 
(Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: too_many_st_failures: No 
devices found in cluster to fence rhel7-02.alteeve.ca, giving up
Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
====

I've tried with the full host names and with the short host names in 
'pcmk_host_list=', but the same result both times.

Versions:
====
pacemaker-1.1.10-19.el7.x86_64
pcs-0.9.99-2.el7.x86_64
====

Can someone hit me with a clustick?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?