[Pacemaker] STONITH is not performed after stonithd reboots

Tue Aug 7 11:30:02 UTC 2012

On Wed, Aug 1, 2012 at 7:26 PM, Kazunori INOUE
<inouekazu at intellilink.co.jp> wrote:
> Hi,
>
> This problem has not been fixed yet. (2012 Jul 29, 33119da31c)
> When stonithd was terminated abnormally, doesn't crmd have to reboot
> like time when lrmd was terminated?
>
> The following patch will reboot crmd, if connection with stonithd
> breaks. I checked this problem was fixed, however cannot grasp the
> extent of the impact...

It's quite severe :-)
I'd like to see if we can come up with something a little less brutal.

Could you file a bugzilla for me please?

>
> [root at dev1 pacemaker]# git diff
> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
> index f6a7550..deb4513 100644
> --- a/crmd/te_utils.c
> +++ b/crmd/te_utils.c
> @@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st,
> stonith_event_t *e)
>  {
>      if (is_set(fsa_input_register, R_ST_REQUIRED)) {
>          crm_crit("Fencing daemon connection failed");
> +        register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL);
>          mainloop_set_trigger(stonith_reconnect);
>
>      } else {
> [root at dev1 pacemaker]#
>
> Best regards,
> Kazunori INOUE
>
>
> (12.05.09 16:11), Andrew Beekhof wrote:
>>
>> On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
>> <inouekazu at intellilink.co.jp> wrote:
>>>
>>> Hi,
>>>
>>> On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
>>> after an abnormal end, STONITH is not performed after that.
>>>
>>> I am using the newest devel.
>>> - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
>>> - corosync  : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
>>> - glue      : 2686:66d5f0c135c9
>>>
>>>
>>> * 0. cluster's state.
>>>
>>>   [root at vm1 ~]# crm_mon -r1
>>>   ============
>>>   Last updated: Wed May  2 16:07:29 2012
>>>   Last change: Wed May  2 16:06:35 2012 via cibadmin on vm1
>>>   Stack: corosync
>>>   Current DC: vm1 (1) - partition WITHOUT quorum
>>>   Version: 1.1.7-db5e167
>>>   2 Nodes configured, unknown expected votes
>>>   3 Resources configured.
>>>   ============
>>>
>>>   Online: [ vm1 vm2 ]
>>>
>>>   Full list of resources:
>>>
>>>   prmDummy       (ocf::pacemaker:Dummy): Started vm2
>>>   prmStonith1    (stonith:external/libvirt):     Started vm2
>>>   prmStonith2    (stonith:external/libvirt):     Started vm1
>>>
>>>   [root at vm1 ~]# crm configure show
>>>   node $id="1" vm1
>>>   node $id="2" vm2
>>>   primitive prmDummy ocf:pacemaker:Dummy \
>>>          op start interval="0s" timeout="60s" on-fail="restart" \
>>>          op monitor interval="10s" timeout="60s" on-fail="fence" \
>>>          op stop interval="0s" timeout="60s" on-fail="stop"
>>>   primitive prmStonith1 stonith:external/libvirt \
>>>          params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
>>>          op start interval="0s" timeout="60s" \
>>>          op monitor interval="3600s" timeout="60s" \
>>>          op stop interval="0s" timeout="60s"
>>>   primitive prmStonith2 stonith:external/libvirt \
>>>          params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
>>>          op start interval="0s" timeout="60s" \
>>>          op monitor interval="3600s" timeout="60s" \
>>>          op stop interval="0s" timeout="60s"
>>>   location rsc_location-prmDummy prmDummy \
>>>          rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
>>>   location rsc_location-prmStonith1 prmStonith1 \
>>>          rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
>>>          rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
>>>   location rsc_location-prmStonith2 prmStonith2 \
>>>          rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
>>>          rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
>>>   property $id="cib-bootstrap-options" \
>>>          dc-version="1.1.7-db5e167" \
>>>          cluster-infrastructure="corosync" \
>>>          no-quorum-policy="ignore" \
>>>          stonith-enabled="true" \
>>>          startup-fencing="false" \
>>>          stonith-timeout="120s"
>>>   rsc_defaults $id="rsc-options" \
>>>          resource-stickiness="INFINITY" \
>>>          migration-threshold="1"
>>>
>>>
>>> * 1. terminate stonithd forcibly.
>>>
>>>   [root at vm1 ~]# pkill -9 stonithd
>>>
>>>
>>> * 2. I cause STONITH, but stonithd says that a device is not found and
>>>    does not STONITH.
>>>
>>>   [root at vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
>>>   [root at vm1 ~]# grep Found /var/log/ha-debug
>>>   May  2 16:13:07 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>> matching devices for 'vm2'
>>>   May  2 16:13:19 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>> matching devices for 'vm2'
>>>   May  2 16:13:31 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>> matching devices for 'vm2'
>>>   May  2 16:13:43 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>> matching devices for 'vm2'
>>>   (snip)
>>>   [root at vm1 ~]#
>>>
>>>
>>> After stonithd reboots, it seems that STONITH-resource or lrmd needs
>>> to be rebooted.. is this the designed behavior?
>>
>>
>> No, that sounds like a bug.
>>
>>>
>>>   # crm resource restart <STONITH resource (prmStonith2)>
>>>   or
>>>   # /usr/lib64/heartbeat/lrmd -r  (on the node which stonithd rebooted)
>>>
>>> ----
>>> Best regards,
>>> Kazunori INOUE
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org