[Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh

Mon Dec 23 18:30:25 UTC 2013

----- Original Message -----
> From: "Digimer" <lists at alteeve.ca>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Saturday, December 21, 2013 2:39:46 PM
> Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 +	fence_virsh
> 
> Hi all,
> 
>    I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta
> VMs. I've got stonith configured and it technically works (crashed node
> reboots), but pacemaker hangs...
> 
> Here is the config:
> 
> ====
> Cluster Name: rhel7-pcmk
> Corosync Nodes:
>   rhel7-01.alteeve.ca rhel7-02.alteeve.ca
> Pacemaker Nodes:
>   rhel7-01.alteeve.ca rhel7-02.alteeve.ca
> 
> Resources:
> 
> Stonith Devices:
>   Resource: fence_n01_virsh (class=stonith type=fence_virsh)
>    Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot
> login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
>    Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s)
>   Resource: fence_n02_virsh (class=stonith type=fence_virsh)
>    Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot

When using fence_virt, the easiest way to configure everything is to name the actual virtual machines the same as what their corosync node names are going to be.

If you run this command in a virtual machine, you can see the names fence_virt thinks all the nodes are.
fence_xvm -o list
node1          c4dbe904-f51a-d53f-7ef0-2b03361c6401 on
node2          c4dbe904-f51a-d53f-7ef0-2b03361c6402 on
node3          c4dbe904-f51a-d53f-7ef0-2b03361c6403 on

If you name the vm the same as the node name, you don't even have to list the static host list. Stonith will do all that magic behind the scenes. If the node names do not match, try the 'pcmk_host_map' option. I believe you should be able to map the corosync node name to the vm's name using that option.

Hope that helps :)

-- Vossel

> login=root passwd_script=/root/lemass.pw port=rhel7_02
>    Operations: monitor interval=60s (fence_n02_virsh-monitor-interval-60s)
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
> Colocation Constraints:
> 
> Cluster Properties:
>   cluster-infrastructure: corosync
>   dc-version: 1.1.10-19.el7-368c726
>   no-quorum-policy: ignore
>   stonith-enabled: true
> ====
> 
> Here are the logs:
> 
> ====
> Dec 21 14:36:07 rhel7-01 corosync[1709]: [TOTEM ] A processor failed,
> forming new configuration.
> Dec 21 14:36:09 rhel7-01 corosync[1709]: [TOTEM ] A new membership
> (192.168.122.101:24) was formed. Members left: 2
> Dec 21 14:36:09 rhel7-01 corosync[1709]: [QUORUM] Members[1]: 1
> Dec 21 14:36:09 rhel7-01 corosync[1709]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now
> lost (was member)
> Dec 21 14:36:09 rhel7-01 crmd[1730]: warning: reap_dead_nodes: Our DC
> node (rhel7-02.alteeve.ca) left the cluster
> Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
> cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
> Dec 21 14:36:09 rhel7-01 pacemakerd[1724]: notice:
> crm_update_peer_state: pcmk_quorum_notification: Node
> rhel7-02.alteeve.ca[2] - state is now lost (was member)
> Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_election_check ]
> Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: pe_fence_node: Node
> rhel7-02.alteeve.ca will be fenced because fence_n02_virsh is thought to
> be active there
> Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: custom_action: Action
> fence_n02_virsh_stop_0 on rhel7-02.alteeve.ca is unrunnable (offline)
> Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: stage6: Scheduling Node
> rhel7-02.alteeve.ca for STONITH
> Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: LogActions: Move
> fence_n02_virsh	(Started rhel7-02.alteeve.ca -> rhel7-01.alteeve.ca)
> Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: process_pe_message:
> Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: te_fence_node: Executing
> reboot fencing operation (11) on rhel7-02.alteeve.ca (timeout=60000)
> Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: handle_request:
> Client crmd.1730.4f6ea9e1 wants to fence (reboot) 'rhel7-02.alteeve.ca'
> with device '(any)'
> Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> rhel7-02.alteeve.ca: ea720bbf-aeab-43bb-a196-3a4c091dea75 (0)
> Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
> can_fence_host_with_device: fence_n01_virsh can not fence
> rhel7-02.alteeve.ca: static-list
> Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice:
> can_fence_host_with_device: fence_n02_virsh can not fence
> rhel7-02.alteeve.ca: static-list
> Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: error: remote_op_done:
> Operation reboot of rhel7-02.alteeve.ca by rhel7-01.alteeve.ca for
> crmd.1730 at rhel7-01.alteeve.ca.ea720bbf: No such device
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback:
> Stonith operation 2/11:0:0:52e1fdf2-0b3a-42be-b7df-4d9dadb8d98b: No such
> device (-19)
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_callback:
> Stonith operation 2 for rhel7-02.alteeve.ca failed (No such device):
> aborting transition.
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: tengine_stonith_notify:
> Peer rhel7-02.alteeve.ca was not terminated (reboot) by
> rhel7-01.alteeve.ca for rhel7-01.alteeve.ca: No such device
> (ref=ea720bbf-aeab-43bb-a196-3a4c091dea75) by client crmd.1730
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: run_graph: Transition 0
> (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: too_many_st_failures: No
> devices found in cluster to fence rhel7-02.alteeve.ca, giving up
> Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> ====
> 
> I've tried with the full host names and with the short host names in
> 'pcmk_host_list=', but the same result both times.
> 
> Versions:
> ====
> pacemaker-1.1.10-19.el7.x86_64
> pcs-0.9.99-2.el7.x86_64
> ====
> 
> Can someone hit me with a clustick?
> 
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>