[ClusterLabs] Antw: Re: Three VM's in cluster, running on multiple libvirt hosts, stonith not working

Mon Aug 3 22:17:07 EDT 2015

> On 3 Jun 2015, at 4:20 pm, Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> 
>>>> Steve Dainard <sdainard at spd1.com> schrieb am 02.06.2015 um 21:40 in Nachricht
> <CAEMJtDs3vq4UZtb1DJioGP3w-JaedqWW5vHPhMvf3Tj7mHB9ew at mail.gmail.com>:
>> Hi Ken,
>> 
>> I've tried configuring without pcmk_host_list as well with the same result.
> 
> I can't help here, sorry. But: Is there a mechanism to trigger fencing of a specific node through the cluster manually? That would help testing, I guess.
> 
> What would be the command-line to run a "fencing RA”?

You saw some below, they’re self contained executables :-)
Just replace ‘list’ with ‘off’ or ‘reboot’, eg.

   fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o reboot

> 
> Regards,
> Ulrich
> 
>> 
>> Stonith Devices:
>> Resource: NFS1 (class=stonith type=fence_xvm)
>>  Attributes: key_file=/etc/cluster/fence_xvm_ceph1.key
>> multicast_address=225.0.0.12 port=NFS1
>>  Operations: monitor interval=20s (NFS1-monitor-interval-20s)
>> Resource: NFS2 (class=stonith type=fence_xvm)
>>  Attributes: key_file=/etc/cluster/fence_xvm_ceph2.key
>> multicast_address=225.0.1.12 port=NFS2
>>  Operations: monitor interval=20s (NFS2-monitor-interval-20s)
>> Resource: NFS3 (class=stonith type=fence_xvm)
>>  Attributes: key_file=/etc/cluster/fence_xvm_ceph3.key
>> multicast_address=225.0.2.12 port=NFS3
>>  Operations: monitor interval=20s (NFS3-monitor-interval-20s)
>> 
>> I can get the list of VM's from any of the 3 cluster nodes using the
>> multicast address:
>> 
>> # fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o list
>> NFS1                 1814d93d-3e40-797f-a3c6-102aaa6a3d01 on
>> 
>> # fence_xvm -a 225.0.1.12 -k /etc/cluster/fence_xvm_ceph2.key -o list
>> NFS2                 75ab85fc-40e9-45ae-8b0a-c346d59b24e8 on
>> 
>> # fence_xvm -a 225.0.2.12 -k /etc/cluster/fence_xvm_ceph3.key -o list
>> NFS3                 f23cca5d-d50b-46d2-85dd-d8357337fd22 on
>> 
>> On Tue, Jun 2, 2015 at 10:07 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>> 
>>> On 06/02/2015 11:40 AM, Steve Dainard wrote:
>>>> Hello,
>>>> 
>>>> I have 3 CentOS7 guests running on 3 CentOS7 hypervisors and I can't get
>>>> stonith operations to work.
>>>> 
>>>> Config:
>>>> 
>>>> Cluster Name: nfs
>>>> Corosync Nodes:
>>>> node1 node2 node3
>>>> Pacemaker Nodes:
>>>> node1 node2 node3
>>>> 
>>>> Resources:
>>>> Group: group_rbd_fs_nfs_vip
>>>>  Resource: rbd_nfs-ha (class=ocf provider=ceph type=rbd.in)
>>>>   Attributes: user=admin pool=rbd name=nfs-ha
>>> cephconf=/etc/ceph/ceph.conf
>>>>   Operations: start interval=0s timeout=20 (rbd_nfs-ha-start-timeout-20)
>>>>               stop interval=0s timeout=20 (rbd_nfs-ha-stop-timeout-20)
>>>>               monitor interval=10s timeout=20s
>>>> (rbd_nfs-ha-monitor-interval-10s)
>>>>  Resource: rbd_home (class=ocf provider=ceph type=rbd.in)
>>>>   Attributes: user=admin pool=rbd name=home cephconf=/etc/ceph/ceph.conf
>>>>   Operations: start interval=0s timeout=20 (rbd_home-start-timeout-20)
>>>>               stop interval=0s timeout=20 (rbd_home-stop-timeout-20)
>>>>               monitor interval=10s timeout=20s
>>>> (rbd_home-monitor-interval-10s)
>>>>  Resource: fs_nfs-ha (class=ocf provider=heartbeat type=Filesystem)
>>>>   Attributes: directory=/mnt/nfs-ha fstype=btrfs
>>>> device=/dev/rbd/rbd/nfs-ha fast_stop=no
>>>>   Operations: monitor interval=20s timeout=40s
>>>> (fs_nfs-ha-monitor-interval-20s)
>>>>               start interval=0 timeout=60s (fs_nfs-ha-start-interval-0)
>>>>               stop interval=0 timeout=60s (fs_nfs-ha-stop-interval-0)
>>>>  Resource: FS_home (class=ocf provider=heartbeat type=Filesystem)
>>>>   Attributes: directory=/mnt/home fstype=btrfs device=/dev/rbd/rbd/home
>>>> options=rw,compress-force=lzo fast_stop=no
>>>>   Operations: monitor interval=20s timeout=40s
>>>> (FS_home-monitor-interval-20s)
>>>>               start interval=0 timeout=60s (FS_home-start-interval-0)
>>>>               stop interval=0 timeout=60s (FS_home-stop-interval-0)
>>>>  Resource: nfsserver (class=ocf provider=heartbeat type=nfsserver)
>>>>   Attributes: nfs_shared_infodir=/mnt/nfs-ha
>>>>   Operations: stop interval=0s timeout=20s (nfsserver-stop-timeout-20s)
>>>>               monitor interval=10s timeout=20s
>>>> (nfsserver-monitor-interval-10s)
>>>>               start interval=0 timeout=40s (nfsserver-start-interval-0)
>>>>  Resource: vip_nfs_private (class=ocf provider=heartbeat type=IPaddr)
>>>>   Attributes: ip=10.0.231.49 cidr_netmask=24
>>>>   Operations: start interval=0s timeout=20s
>>>> (vip_nfs_private-start-timeout-20s)
>>>>               stop interval=0s timeout=20s
>>>> (vip_nfs_private-stop-timeout-20s)
>>>>               monitor interval=5 (vip_nfs_private-monitor-interval-5)
>>>> 
>>>> Stonith Devices:
>>>> Resource: NFS1 (class=stonith type=fence_xvm)
>>>>  Attributes: pcmk_host_list=10.0.231.50
>>>> key_file=/etc/cluster/fence_xvm_ceph1.key multicast_address=225.0.0.12
>>>> port=NFS1
>>>>  Operations: monitor interval=20s (NFS1-monitor-interval-20s)
>>>> Resource: NFS2 (class=stonith type=fence_xvm)
>>>>  Attributes: pcmk_host_list=10.0.231.51
>>>> key_file=/etc/cluster/fence_xvm_ceph2.key multicast_address=225.0.1.12
>>>> port=NFS2
>>>>  Operations: monitor interval=20s (NFS2-monitor-interval-20s)
>>>> Resource: NFS3 (class=stonith type=fence_xvm)
>>>>  Attributes: pcmk_host_list=10.0.231.52
>>>> key_file=/etc/cluster/fence_xvm_ceph3.key multicast_address=225.0.2.12
>>>> port=NFS3
>>> 
>>> I think pcmk_host_list should have the node name rather than the IP
>>> address. If fence_xvm -o list -a whatever shows the right nodes to
>>> fence, you don't even need to set pcmk_host_list.
>>> 
>>>>  Operations: monitor interval=20s (NFS3-monitor-interval-20s)
>>>> Fencing Levels:
>>>> 
>>>> Location Constraints:
>>>>  Resource: NFS1
>>>>    Enabled on: node1 (score:1) (id:location-NFS1-node1-1)
>>>>    Enabled on: node2 (score:1000) (id:location-NFS1-node2-1000)
>>>>    Enabled on: node3 (score:500) (id:location-NFS1-node3-500)
>>>>  Resource: NFS2
>>>>    Enabled on: node2 (score:1) (id:location-NFS2-node2-1)
>>>>    Enabled on: node3 (score:1000) (id:location-NFS2-node3-1000)
>>>>    Enabled on: node1 (score:500) (id:location-NFS2-node1-500)
>>>>  Resource: NFS3
>>>>    Enabled on: node3 (score:1) (id:location-NFS3-node3-1)
>>>>    Enabled on: node1 (score:1000) (id:location-NFS3-node1-1000)
>>>>    Enabled on: node2 (score:500) (id:location-NFS3-node2-500)
>>>> Ordering Constraints:
>>>> Colocation Constraints:
>>>> 
>>>> Cluster Properties:
>>>> cluster-infrastructure: corosync
>>>> cluster-name: nfs
>>>> dc-version: 1.1.12-a14efad
>>>> have-watchdog: false
>>>> stonith-enabled: true
>>>> 
>>>> When I stop networking services on node1 (stonith resource NFS1) I see
>>> logs
>>>> on the other two cluster nodes attempting to reboot the vm NFS1 without
>>>> success.
>>>> 
>>>> Logs:
>>>> 
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    rbd_nfs-ha      (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    rbd_home        (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    fs_nfs-ha       (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    FS_home (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    nfsserver       (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    vip_nfs_private (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:     info:
>>> LogActions:
>>>>     Leave   NFS1    (Started node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:     info:
>>> LogActions:
>>>>     Leave   NFS2    (Started node3)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice:
>>> LogActions:
>>>>     Move    NFS3    (Started node1 -> node2)
>>>> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:  warning:
>>>> process_pe_message:      Calculated Transition 8:
>>>> /var/lib/pacemaker/pengine/pe-warn-0.bz2
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:     info:
>>>> do_state_transition:     State transition S_POLICY_ENGINE ->
>>>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>>>> origin=handle_response ]
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:     info:
>>>> do_te_invoke:    Processing graph 8 (ref=pe_calc-dc-1433198297-78)
>>> derived
>>>> from /var/lib/pacemaker/pengine/pe-warn-0.bz2
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>>> te_fence_node:   Executing reboot fencing operation (37) on node1
>>>> (timeout=60000)
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
>>>> handle_request:  Client crmd.2131.f7e79b61 wants to fence (reboot)
>>> 'node1'
>>>> with device '(any)'
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
>>>> initiate_remote_stonith_op:      Initiating remote operation reboot for
>>>> node1: a22a16f3-b699-453e-a090-43a640dd0e3f (0)
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
>>>> can_fence_host_with_device:      NFS1 can not fence (reboot) node1:
>>>> static-list
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
>>>> can_fence_host_with_device:      NFS2 can not fence (reboot) node1:
>>>> static-list
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
>>>> can_fence_host_with_device:      NFS3 can not fence (reboot) node1:
>>>> static-list
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:     info:
>>>> process_remote_stonith_query:    All queries have arrived, continuing (2,
>>>> 2, 2, a22a16f3-b699-453e-a090-43a640dd0e3f)
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
>>>> stonith_choose_peer:     Couldn't find anyone to fence node1 with <any>
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:     info:
>>>> call_remote_stonith:     Total remote op timeout set to 60 for fencing of
>>>> node node1 for crmd.2131.a22a16f3
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:     info:
>>>> call_remote_stonith:     None of the 2 peers have devices capable of
>>>> terminating node1 for crmd.2131 (0)
>>>> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:    error:
>>>> remote_op_done:  Operation reboot of node1 by <no-one> for
>>>> crmd.2131 at node3.a22a16f3: No such device
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>>> tengine_stonith_callback:        Stonith operation
>>>> 2/37:8:0:241ee032-f3a1-4c2b-8427-63af83b54343: No such device (-19)
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>>> tengine_stonith_callback:        Stonith operation 2 for node1 failed (No
>>>> such device): aborting transition.
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>>> abort_transition_graph:  Transition aborted: Stonith failed
>>>> (source=tengine_stonith_callback:697, 0)
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>>> tengine_stonith_notify:  Peer node1 was not terminated (reboot) by
>>> <anyone>
>>>> for node3: No such device (ref=a22a16f3-b699-453e-a090-43a640dd0e3f) by
>>>> client crmd.2131
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>> run_graph:
>>>>    Transition 8 (Complete=1, Pending=0, Fired=0, Skipped=27,
>>> Incomplete=0,
>>>> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
>>>> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
>>>> too_many_st_failures:    No devices found in cluster to fence node1,
>>> giving
>>>> up
>>>> 
>>>> I can manually fence a guest without any issue:
>>>> # fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o reboot
>>> -H
>>>> NFS1
>>>> 
>>>> But the cluster doesn't recover resources to another host:
>>> 
>>> The cluster doesn't know that the manual fencing succeeded, so it plays
>>> it safe by not moving resources. If you fix the cluster fencing issue,
>>> I'd expect this to work.
>>> 
>>>> # pcs status *<-- after manual fencing*
>>>> Cluster name: nfs
>>>> Last updated: Tue Jun  2 08:34:18 2015
>>>> Last change: Mon Jun  1 16:02:58 2015
>>>> Stack: corosync
>>>> Current DC: node3 (3) - partition with quorum
>>>> Version: 1.1.12-a14efad
>>>> 3 Nodes configured
>>>> 9 Resources configured
>>>> 
>>>> 
>>>> Node node1 (1): UNCLEAN (offline)
>>>> Online: [ node2 node3 ]
>>>> 
>>>> Full list of resources:
>>>> 
>>>> Resource Group: group_rbd_fs_nfs_vip
>>>>     rbd_nfs-ha (ocf::ceph:rbd.in):     Started node1
>>>>     rbd_home   (ocf::ceph:rbd.in):     Started node1
>>>>     fs_nfs-ha  (ocf::heartbeat:Filesystem):    Started node1
>>>>     FS_home    (ocf::heartbeat:Filesystem):    Started node1
>>>>     nfsserver  (ocf::heartbeat:nfsserver):     Started node1
>>>>     vip_nfs_private    (ocf::heartbeat:IPaddr):        Started node1
>>>> NFS1   (stonith:fence_xvm):    Started node2
>>>> NFS2   (stonith:fence_xvm):    Started node3
>>>> NFS3   (stonith:fence_xvm):    Started node1
>>>> 
>>>> PCSD Status:
>>>>  node1: Online
>>>>  node2: Online
>>>>  node3: Online
>>>> 
>>>> Daemon Status:
>>>>  corosync: active/disabled
>>>>  pacemaker: active/disabled
>>>>  pcsd: active/enabled
>>>> 
>>>> Fence_virtd config on one of the hypervisors:
>>>> # cat fence_virt.conf
>>>> backends {
>>>>        libvirt {
>>>>                uri = "qemu:///system";
>>>>        }
>>>> 
>>>> }
>>>> 
>>>> listeners {
>>>>        multicast {
>>>>                port = "1229";
>>>>                family = "ipv4";
>>>>                interface = "br1";
>>>>                address = "225.0.0.12";
>>>>                key_file = "/etc/cluster/fence_xvm_ceph1.key";
>>>>        }
>>>> 
>>>> }
>>>> 
>>>> fence_virtd {
>>>>        module_path = "/usr/lib64/fence-virt";
>>>>        backend = "libvirt";
>>>>        listener = "multicast";
>>>> }
>>> 
>>> 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>> 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org