[Pacemaker] STONITH is not performed after stonithd reboots

Wed Aug 8 07:59:42 UTC 2012

Thanks Andrew,
I opened Bugzilla about this.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5094

Best Regards,
Kazunori INOUE

(12.08.07 20:30), Andrew Beekhof wrote:
> On Wed, Aug 1, 2012 at 7:26 PM, Kazunori INOUE
> <inouekazu at intellilink.co.jp> wrote:
>> Hi,
>>
>> This problem has not been fixed yet. (2012 Jul 29, 33119da31c)
>> When stonithd was terminated abnormally, doesn't crmd have to reboot
>> like time when lrmd was terminated?
>>
>> The following patch will reboot crmd, if connection with stonithd
>> breaks. I checked this problem was fixed, however cannot grasp the
>> extent of the impact...
>
> It's quite severe :-)
> I'd like to see if we can come up with something a little less brutal.
>
> Could you file a bugzilla for me please?
>
>>
>> [root at dev1 pacemaker]# git diff
>> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>> index f6a7550..deb4513 100644
>> --- a/crmd/te_utils.c
>> +++ b/crmd/te_utils.c
>> @@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st,
>> stonith_event_t *e)
>>   {
>>       if (is_set(fsa_input_register, R_ST_REQUIRED)) {
>>           crm_crit("Fencing daemon connection failed");
>> +        register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL);
>>           mainloop_set_trigger(stonith_reconnect);
>>
>>       } else {
>> [root at dev1 pacemaker]#
>>
>> Best regards,
>> Kazunori INOUE
>>
>>
>> (12.05.09 16:11), Andrew Beekhof wrote:
>>>
>>> On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
>>> <inouekazu at intellilink.co.jp> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
>>>> after an abnormal end, STONITH is not performed after that.
>>>>
>>>> I am using the newest devel.
>>>> - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
>>>> - corosync  : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
>>>> - glue      : 2686:66d5f0c135c9
>>>>
>>>>
>>>> * 0. cluster's state.
>>>>
>>>>    [root at vm1 ~]# crm_mon -r1
>>>>    ============
>>>>    Last updated: Wed May  2 16:07:29 2012
>>>>    Last change: Wed May  2 16:06:35 2012 via cibadmin on vm1
>>>>    Stack: corosync
>>>>    Current DC: vm1 (1) - partition WITHOUT quorum
>>>>    Version: 1.1.7-db5e167
>>>>    2 Nodes configured, unknown expected votes
>>>>    3 Resources configured.
>>>>    ============
>>>>
>>>>    Online: [ vm1 vm2 ]
>>>>
>>>>    Full list of resources:
>>>>
>>>>    prmDummy       (ocf::pacemaker:Dummy): Started vm2
>>>>    prmStonith1    (stonith:external/libvirt):     Started vm2
>>>>    prmStonith2    (stonith:external/libvirt):     Started vm1
>>>>
>>>>    [root at vm1 ~]# crm configure show
>>>>    node $id="1" vm1
>>>>    node $id="2" vm2
>>>>    primitive prmDummy ocf:pacemaker:Dummy \
>>>>           op start interval="0s" timeout="60s" on-fail="restart" \
>>>>           op monitor interval="10s" timeout="60s" on-fail="fence" \
>>>>           op stop interval="0s" timeout="60s" on-fail="stop"
>>>>    primitive prmStonith1 stonith:external/libvirt \
>>>>           params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
>>>>           op start interval="0s" timeout="60s" \
>>>>           op monitor interval="3600s" timeout="60s" \
>>>>           op stop interval="0s" timeout="60s"
>>>>    primitive prmStonith2 stonith:external/libvirt \
>>>>           params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
>>>>           op start interval="0s" timeout="60s" \
>>>>           op monitor interval="3600s" timeout="60s" \
>>>>           op stop interval="0s" timeout="60s"
>>>>    location rsc_location-prmDummy prmDummy \
>>>>           rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
>>>>    location rsc_location-prmStonith1 prmStonith1 \
>>>>           rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
>>>>           rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
>>>>    location rsc_location-prmStonith2 prmStonith2 \
>>>>           rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
>>>>           rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
>>>>    property $id="cib-bootstrap-options" \
>>>>           dc-version="1.1.7-db5e167" \
>>>>           cluster-infrastructure="corosync" \
>>>>           no-quorum-policy="ignore" \
>>>>           stonith-enabled="true" \
>>>>           startup-fencing="false" \
>>>>           stonith-timeout="120s"
>>>>    rsc_defaults $id="rsc-options" \
>>>>           resource-stickiness="INFINITY" \
>>>>           migration-threshold="1"
>>>>
>>>>
>>>> * 1. terminate stonithd forcibly.
>>>>
>>>>    [root at vm1 ~]# pkill -9 stonithd
>>>>
>>>>
>>>> * 2. I cause STONITH, but stonithd says that a device is not found and
>>>>     does not STONITH.
>>>>
>>>>    [root at vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
>>>>    [root at vm1 ~]# grep Found /var/log/ha-debug
>>>>    May  2 16:13:07 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>>    May  2 16:13:19 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>>    May  2 16:13:31 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>>    May  2 16:13:43 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>>    (snip)
>>>>    [root at vm1 ~]#
>>>>
>>>>
>>>> After stonithd reboots, it seems that STONITH-resource or lrmd needs
>>>> to be rebooted.. is this the designed behavior?
>>>
>>>
>>> No, that sounds like a bug.
>>>
>>>>
>>>>    # crm resource restart <STONITH resource (prmStonith2)>
>>>>    or
>>>>    # /usr/lib64/heartbeat/lrmd -r  (on the node which stonithd rebooted)
>>>>
>>>> ----
>>>> Best regards,
>>>> Kazunori INOUE
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org