[Pacemaker] STONITH is not performed after stonithd reboots

Wed Aug 1 05:26:19 EDT 2012

Hi,

This problem has not been fixed yet. (2012 Jul 29, 33119da31c)
When stonithd was terminated abnormally, doesn't crmd have to reboot
like time when lrmd was terminated?

The following patch will reboot crmd, if connection with stonithd
breaks. I checked this problem was fixed, however cannot grasp the
extent of the impact...

[root at dev1 pacemaker]# git diff

diff --git a/crmd/te_utils.c b/crmd/te_utils.c
index f6a7550..deb4513 100644
--- a/crmd/te_utils.c
+++ b/crmd/te_utils.c
@@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st, stonith_event_t *e)
  {
      if (is_set(fsa_input_register, R_ST_REQUIRED)) {
          crm_crit("Fencing daemon connection failed");
+        register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL);
          mainloop_set_trigger(stonith_reconnect);

      } else {
[root at dev1 pacemaker]#

Best regards,
Kazunori INOUE

(12.05.09 16:11), Andrew Beekhof wrote:
> On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
> <inouekazu at intellilink.co.jp> wrote:
>> Hi,
>>
>> On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
>> after an abnormal end, STONITH is not performed after that.
>>
>> I am using the newest devel.
>> - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
>> - corosync  : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
>> - glue      : 2686:66d5f0c135c9
>>
>>
>> * 0. cluster's state.
>>
>>   [root at vm1 ~]# crm_mon -r1
>>   ============
>>   Last updated: Wed May  2 16:07:29 2012
>>   Last change: Wed May  2 16:06:35 2012 via cibadmin on vm1
>>   Stack: corosync
>>   Current DC: vm1 (1) - partition WITHOUT quorum
>>   Version: 1.1.7-db5e167
>>   2 Nodes configured, unknown expected votes
>>   3 Resources configured.
>>   ============
>>
>>   Online: [ vm1 vm2 ]
>>
>>   Full list of resources:
>>
>>   prmDummy       (ocf::pacemaker:Dummy): Started vm2
>>   prmStonith1    (stonith:external/libvirt):     Started vm2
>>   prmStonith2    (stonith:external/libvirt):     Started vm1
>>
>>   [root at vm1 ~]# crm configure show
>>   node $id="1" vm1
>>   node $id="2" vm2
>>   primitive prmDummy ocf:pacemaker:Dummy \
>>          op start interval="0s" timeout="60s" on-fail="restart" \
>>          op monitor interval="10s" timeout="60s" on-fail="fence" \
>>          op stop interval="0s" timeout="60s" on-fail="stop"
>>   primitive prmStonith1 stonith:external/libvirt \
>>          params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
>>          op start interval="0s" timeout="60s" \
>>          op monitor interval="3600s" timeout="60s" \
>>          op stop interval="0s" timeout="60s"
>>   primitive prmStonith2 stonith:external/libvirt \
>>          params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
>>          op start interval="0s" timeout="60s" \
>>          op monitor interval="3600s" timeout="60s" \
>>          op stop interval="0s" timeout="60s"
>>   location rsc_location-prmDummy prmDummy \
>>          rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
>>   location rsc_location-prmStonith1 prmStonith1 \
>>          rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
>>          rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
>>   location rsc_location-prmStonith2 prmStonith2 \
>>          rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
>>          rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
>>   property $id="cib-bootstrap-options" \
>>          dc-version="1.1.7-db5e167" \
>>          cluster-infrastructure="corosync" \
>>          no-quorum-policy="ignore" \
>>          stonith-enabled="true" \
>>          startup-fencing="false" \
>>          stonith-timeout="120s"
>>   rsc_defaults $id="rsc-options" \
>>          resource-stickiness="INFINITY" \
>>          migration-threshold="1"
>>
>>
>> * 1. terminate stonithd forcibly.
>>
>>   [root at vm1 ~]# pkill -9 stonithd
>>
>>
>> * 2. I cause STONITH, but stonithd says that a device is not found and
>>    does not STONITH.
>>
>>   [root at vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
>>   [root at vm1 ~]# grep Found /var/log/ha-debug
>>   May  2 16:13:07 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>>   May  2 16:13:19 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>>   May  2 16:13:31 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>>   May  2 16:13:43 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 matching devices for 'vm2'
>>   (snip)
>>   [root at vm1 ~]#
>>
>>
>> After stonithd reboots, it seems that STONITH-resource or lrmd needs
>> to be rebooted.. is this the designed behavior?
>
> No, that sounds like a bug.
>
>>
>>   # crm resource restart <STONITH resource (prmStonith2)>
>>   or
>>   # /usr/lib64/heartbeat/lrmd -r  (on the node which stonithd rebooted)
>>
>> ----
>> Best regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org