[Pacemaker] STONITH is not performed after stonithd reboots

Wed May 9 07:11:55 UTC 2012

On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
<inouekazu at intellilink.co.jp> wrote:
> Hi,
>
> On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
> after an abnormal end, STONITH is not performed after that.
>
> I am using the newest devel.
> - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
> - corosync  : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
> - glue      : 2686:66d5f0c135c9
>
>
> * 0. cluster's state.
>
>  [root at vm1 ~]# crm_mon -r1
>  ============
>  Last updated: Wed May  2 16:07:29 2012
>  Last change: Wed May  2 16:06:35 2012 via cibadmin on vm1
>  Stack: corosync
>  Current DC: vm1 (1) - partition WITHOUT quorum
>  Version: 1.1.7-db5e167
>  2 Nodes configured, unknown expected votes
>  3 Resources configured.
>  ============
>
>  Online: [ vm1 vm2 ]
>
>  Full list of resources:
>
>  prmDummy       (ocf::pacemaker:Dummy): Started vm2
>  prmStonith1    (stonith:external/libvirt):     Started vm2
>  prmStonith2    (stonith:external/libvirt):     Started vm1
>
>  [root at vm1 ~]# crm configure show
>  node $id="1" vm1
>  node $id="2" vm2
>  primitive prmDummy ocf:pacemaker:Dummy \
>         op start interval="0s" timeout="60s" on-fail="restart" \
>         op monitor interval="10s" timeout="60s" on-fail="fence" \
>         op stop interval="0s" timeout="60s" on-fail="stop"
>  primitive prmStonith1 stonith:external/libvirt \
>         params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
>         op start interval="0s" timeout="60s" \
>         op monitor interval="3600s" timeout="60s" \
>         op stop interval="0s" timeout="60s"
>  primitive prmStonith2 stonith:external/libvirt \
>         params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
>         op start interval="0s" timeout="60s" \
>         op monitor interval="3600s" timeout="60s" \
>         op stop interval="0s" timeout="60s"
>  location rsc_location-prmDummy prmDummy \
>         rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
>  location rsc_location-prmStonith1 prmStonith1 \
>         rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
>         rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
>  location rsc_location-prmStonith2 prmStonith2 \
>         rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
>         rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
>  property $id="cib-bootstrap-options" \
>         dc-version="1.1.7-db5e167" \
>         cluster-infrastructure="corosync" \
>         no-quorum-policy="ignore" \
>         stonith-enabled="true" \
>         startup-fencing="false" \
>         stonith-timeout="120s"
>  rsc_defaults $id="rsc-options" \
>         resource-stickiness="INFINITY" \
>         migration-threshold="1"
>
>
> * 1. terminate stonithd forcibly.
>
>  [root at vm1 ~]# pkill -9 stonithd
>
>
> * 2. I cause STONITH, but stonithd says that a device is not found and
>   does not STONITH.
>
>  [root at vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
>  [root at vm1 ~]# grep Found /var/log/ha-debug
>  May  2 16:13:07 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>  May  2 16:13:19 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>  May  2 16:13:31 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>  May  2 16:13:43 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>  (snip)
>  [root at vm1 ~]#
>
>
> After stonithd reboots, it seems that STONITH-resource or lrmd needs
> to be rebooted.. is this the designed behavior?

No, that sounds like a bug.

>
>  # crm resource restart <STONITH resource (prmStonith2)>
>  or
>  # /usr/lib64/heartbeat/lrmd -r  (on the node which stonithd rebooted)
>
> ----
> Best regards,
> Kazunori INOUE
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org