[Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh
Digimer
lists at alteeve.ca
Tue Dec 24 00:57:53 UTC 2013
On 23/12/13 04:31 PM, Digimer wrote:
> On 23/12/13 02:31 PM, David Vossel wrote:
>>
>>
>>
>>
>> ----- Original Message -----
>>> From: "Digimer" <lists at alteeve.ca>
>>> To: "The Pacemaker cluster resource manager"
>>> <pacemaker at oss.clusterlabs.org>
>>> Sent: Monday, December 23, 2013 12:42:23 PM
>>> Subject: Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker
>>> 1.1.10 + fence_virsh
>>>
>>> On 23/12/13 01:30 PM, David Vossel wrote:
>>>> ----- Original Message -----
>>>>> From: "Digimer" <lists at alteeve.ca>
>>>>> To: "The Pacemaker cluster resource manager"
>>>>> <pacemaker at oss.clusterlabs.org>
>>>>> Sent: Saturday, December 21, 2013 2:39:46 PM
>>>>> Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker
>>>>> 1.1.10 +
>>>>> fence_virsh
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta
>>>>> VMs. I've got stonith configured and it technically works (crashed
>>>>> node
>>>>> reboots), but pacemaker hangs...
>>>>>
>>>>> Here is the config:
>>>>>
>>>>> ====
>>>>> Cluster Name: rhel7-pcmk
>>>>> Corosync Nodes:
>>>>> rhel7-01.alteeve.ca rhel7-02.alteeve.ca
>>>>> Pacemaker Nodes:
>>>>> rhel7-01.alteeve.ca rhel7-02.alteeve.ca
>>>>>
>>>>> Resources:
>>>>>
>>>>> Stonith Devices:
>>>>> Resource: fence_n01_virsh (class=stonith type=fence_virsh)
>>>>> Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot
>>>>> login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01
>>>>> Operations: monitor interval=60s
>>>>> (fence_n01_virsh-monitor-interval-60s)
>>>>> Resource: fence_n02_virsh (class=stonith type=fence_virsh)
>>>>> Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot
>>>>
>>>>
>>>> When using fence_virt, the easiest way to configure everything is to
>>>> name
>>>> the actual virtual machines the same as what their corosync node
>>>> names are
>>>> going to be.
>>>>
>>>> If you run this command in a virtual machine, you can see the names
>>>> fence_virt thinks all the nodes are.
>>>> fence_xvm -o list
>>>> node1 c4dbe904-f51a-d53f-7ef0-2b03361c6401 on
>>>> node2 c4dbe904-f51a-d53f-7ef0-2b03361c6402 on
>>>> node3 c4dbe904-f51a-d53f-7ef0-2b03361c6403 on
>>>>
>>>> If you name the vm the same as the node name, you don't even have to
>>>> list
>>>> the static host list. Stonith will do all that magic behind the
>>>> scenes. If
>>>> the node names do not match, try the 'pcmk_host_map' option. I
>>>> believe you
>>>> should be able to map the corosync node name to the vm's name using
>>>> that
>>>> option.
>>>>
>>>> Hope that helps :)
>>>>
>>>> -- Vossel
>>>
>>> Hi David,
>>>
>>> I'm using fence_virsh,
>>
>> ah sorry, missed that.
>>
>>> not fence_virtd/fence_xvm. For reasons I've
>>> not been able to resolve, fence_xvm has been unreliable on Fedora for
>>> some time now.
>>
>> the multicast bug :(
>
> That's the one.
>
> I'm rebuilding the nodes now with VM/virsh names that match the host
> name. Will see if that helps/makes a difference.
>
This looks a little better:
====
Dec 23 19:53:33 an-c03n02 corosync[1652]: [TOTEM ] A processor failed,
forming new configuration.
Dec 23 19:53:34 an-c03n02 corosync[1652]: [TOTEM ] A new membership
(192.168.122.102:24) was formed. Members left: 1
Dec 23 19:53:34 an-c03n02 corosync[1652]: [QUORUM] Members[1]: 2
Dec 23 19:53:34 an-c03n02 corosync[1652]: [MAIN ] Completed service
synchronization, ready to provide service.
Dec 23 19:53:34 an-c03n02 pacemakerd[1667]: notice:
crm_update_peer_state: pcmk_quorum_notification: Node
an-c03n01.alteeve.ca[1] - state is now lost (was member)
Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node an-c03n01.alteeve.ca[1] - state is now
lost (was member)
Dec 23 19:53:34 an-c03n02 crmd[1673]: warning: match_down_event: No
match for shutdown action on 1
Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: peer_update_callback:
Stonith/shutdown of an-c03n01.alteeve.ca not matched
Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec 23 19:53:34 an-c03n02 crmd[1673]: warning: match_down_event: No
match for shutdown action on 1
Dec 23 19:53:34 an-c03n02 crmd[1673]: notice: peer_update_callback:
Stonith/shutdown of an-c03n01.alteeve.ca not matched
Dec 23 19:53:34 an-c03n02 attrd[1671]: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Dec 23 19:53:34 an-c03n02 attrd[1671]: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Dec 23 19:53:35 an-c03n02 pengine[1672]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: pe_fence_node: Node
an-c03n01.alteeve.ca will be fenced because the node is no longer part
of the cluster
Dec 23 19:53:35 an-c03n02 pengine[1672]: warning:
determine_online_status: Node an-c03n01.alteeve.ca is unclean
Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: custom_action: Action
fence_n01_virsh_stop_0 on an-c03n01.alteeve.ca is unrunnable (offline)
Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: stage6: Scheduling
Node an-c03n01.alteeve.ca for STONITH
Dec 23 19:53:35 an-c03n02 pengine[1672]: notice: LogActions: Move
fence_n01_virsh (Started an-c03n01.alteeve.ca -> an-c03n02.alteeve.ca)
Dec 23 19:53:35 an-c03n02 pengine[1672]: warning: process_pe_message:
Calculated Transition 1: /var/lib/pacemaker/pengine/pe-warn-0.bz2
Dec 23 19:53:35 an-c03n02 crmd[1673]: notice: te_fence_node: Executing
reboot fencing operation (11) on an-c03n01.alteeve.ca (timeout=60000)
Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice: handle_request:
Client crmd.1673.ebd55f11 wants to fence (reboot) 'an-c03n01.alteeve.ca'
with device '(any)'
Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
initiate_remote_stonith_op: Initiating remote operation reboot for
an-c03n01.alteeve.ca: 12d11de0-ba58-4b28-b0ce-90069b49a177 (0)
Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
can_fence_host_with_device: fence_n01_virsh can fence
an-c03n01.alteeve.ca: static-list
Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
can_fence_host_with_device: fence_n02_virsh can not fence
an-c03n01.alteeve.ca: static-list
Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
can_fence_host_with_device: fence_n01_virsh can fence
an-c03n01.alteeve.ca: static-list
Dec 23 19:53:35 an-c03n02 stonith-ng[1669]: notice:
can_fence_host_with_device: fence_n02_virsh can not fence
an-c03n01.alteeve.ca: static-list
Dec 23 19:53:35 an-c03n02 fence_virsh: Parse error: Ignoring unknown
option 'nodename=an-c03n01.alteeve.ca
Dec 23 19:53:52 an-c03n02 stonith-ng[1669]: notice: log_operation:
Operation 'reboot' [1767] (call 2 from crmd.1673) for host
'an-c03n01.alteeve.ca' with device 'fence_n01_virsh' returned: 0 (OK)
Dec 23 19:53:52 an-c03n02 stonith-ng[1669]: notice: remote_op_done:
Operation reboot of an-c03n01.alteeve.ca by an-c03n02.alteeve.ca for
crmd.1673 at an-c03n02.alteeve.ca.12d11de0: OK
Dec 23 19:53:52 an-c03n02 crmd[1673]: notice: tengine_stonith_callback:
Stonith operation 2/11:1:0:e2533a5d-933a-4c0b-bbba-ca59493a09bd: OK (0)
Dec 23 19:53:52 an-c03n02 crmd[1673]: notice: tengine_stonith_notify:
Peer an-c03n01.alteeve.ca was terminated (reboot) by
an-c03n02.alteeve.ca for an-c03n02.alteeve.ca: OK
(ref=12d11de0-ba58-4b28-b0ce-90069b49a177) by client crmd.1673
Dec 23 19:53:52 an-c03n02 crmd[1673]: notice: te_rsc_command: Initiating
action 6: start fence_n01_virsh_start_0 on an-c03n02.alteeve.ca (local)
Dec 23 19:53:52 an-c03n02 stonith-ng[1669]: notice:
stonith_device_register: Device 'fence_n01_virsh' already existed in
device list (2 active devices)
Dec 23 19:53:54 an-c03n02 crmd[1673]: notice: process_lrm_event: LRM
operation fence_n01_virsh_start_0 (call=12, rc=0, cib-update=46,
confirmed=true) ok
Dec 23 19:53:54 an-c03n02 crmd[1673]: notice: run_graph: Transition 1
(Complete=5, Pending=0, Fired=0, Skipped=1, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
Dec 23 19:53:54 an-c03n02 pengine[1672]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Dec 23 19:53:54 an-c03n02 pengine[1672]: notice: process_pe_message:
Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-2.bz2
Dec 23 19:53:54 an-c03n02 crmd[1673]: notice: te_rsc_command: Initiating
action 7: monitor fence_n01_virsh_monitor_60000 on an-c03n02.alteeve.ca
(local)
Dec 23 19:53:55 an-c03n02 crmd[1673]: notice: process_lrm_event: LRM
operation fence_n01_virsh_monitor_60000 (call=13, rc=0, cib-update=48,
confirmed=false) ok
Dec 23 19:53:55 an-c03n02 crmd[1673]: notice: run_graph: Transition 2
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete
Dec 23 19:53:55 an-c03n02 crmd[1673]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
====
Once the node booted back up, it was able to rejoin the surviving peer.
I've not tested much more yet, but that's already an improvement, so far
as I can tell.
So if the failure was caused by the VM name (as seen by virsh) not
matching the node's hostname, would that be a pacemaker or fence_virsh bug?
Thanks for the help, fellow "what's a holiday?"er!
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list