[Pacemaker] STONITH is not performed after stonithd reboots
Kazunori INOUE
inouekazu at intellilink.co.jp
Wed Aug 8 07:59:42 UTC 2012
Thanks Andrew,
I opened Bugzilla about this.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5094
Best Regards,
Kazunori INOUE
(12.08.07 20:30), Andrew Beekhof wrote:
> On Wed, Aug 1, 2012 at 7:26 PM, Kazunori INOUE
> <inouekazu at intellilink.co.jp> wrote:
>> Hi,
>>
>> This problem has not been fixed yet. (2012 Jul 29, 33119da31c)
>> When stonithd was terminated abnormally, doesn't crmd have to reboot
>> like time when lrmd was terminated?
>>
>> The following patch will reboot crmd, if connection with stonithd
>> breaks. I checked this problem was fixed, however cannot grasp the
>> extent of the impact...
>
> It's quite severe :-)
> I'd like to see if we can come up with something a little less brutal.
>
> Could you file a bugzilla for me please?
>
>>
>> [root at dev1 pacemaker]# git diff
>> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>> index f6a7550..deb4513 100644
>> --- a/crmd/te_utils.c
>> +++ b/crmd/te_utils.c
>> @@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st,
>> stonith_event_t *e)
>> {
>> if (is_set(fsa_input_register, R_ST_REQUIRED)) {
>> crm_crit("Fencing daemon connection failed");
>> + register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL);
>> mainloop_set_trigger(stonith_reconnect);
>>
>> } else {
>> [root at dev1 pacemaker]#
>>
>> Best regards,
>> Kazunori INOUE
>>
>>
>> (12.05.09 16:11), Andrew Beekhof wrote:
>>>
>>> On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
>>> <inouekazu at intellilink.co.jp> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
>>>> after an abnormal end, STONITH is not performed after that.
>>>>
>>>> I am using the newest devel.
>>>> - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
>>>> - corosync : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
>>>> - glue : 2686:66d5f0c135c9
>>>>
>>>>
>>>> * 0. cluster's state.
>>>>
>>>> [root at vm1 ~]# crm_mon -r1
>>>> ============
>>>> Last updated: Wed May 2 16:07:29 2012
>>>> Last change: Wed May 2 16:06:35 2012 via cibadmin on vm1
>>>> Stack: corosync
>>>> Current DC: vm1 (1) - partition WITHOUT quorum
>>>> Version: 1.1.7-db5e167
>>>> 2 Nodes configured, unknown expected votes
>>>> 3 Resources configured.
>>>> ============
>>>>
>>>> Online: [ vm1 vm2 ]
>>>>
>>>> Full list of resources:
>>>>
>>>> prmDummy (ocf::pacemaker:Dummy): Started vm2
>>>> prmStonith1 (stonith:external/libvirt): Started vm2
>>>> prmStonith2 (stonith:external/libvirt): Started vm1
>>>>
>>>> [root at vm1 ~]# crm configure show
>>>> node $id="1" vm1
>>>> node $id="2" vm2
>>>> primitive prmDummy ocf:pacemaker:Dummy \
>>>> op start interval="0s" timeout="60s" on-fail="restart" \
>>>> op monitor interval="10s" timeout="60s" on-fail="fence" \
>>>> op stop interval="0s" timeout="60s" on-fail="stop"
>>>> primitive prmStonith1 stonith:external/libvirt \
>>>> params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
>>>> op start interval="0s" timeout="60s" \
>>>> op monitor interval="3600s" timeout="60s" \
>>>> op stop interval="0s" timeout="60s"
>>>> primitive prmStonith2 stonith:external/libvirt \
>>>> params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
>>>> op start interval="0s" timeout="60s" \
>>>> op monitor interval="3600s" timeout="60s" \
>>>> op stop interval="0s" timeout="60s"
>>>> location rsc_location-prmDummy prmDummy \
>>>> rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
>>>> location rsc_location-prmStonith1 prmStonith1 \
>>>> rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
>>>> rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
>>>> location rsc_location-prmStonith2 prmStonith2 \
>>>> rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
>>>> rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
>>>> property $id="cib-bootstrap-options" \
>>>> dc-version="1.1.7-db5e167" \
>>>> cluster-infrastructure="corosync" \
>>>> no-quorum-policy="ignore" \
>>>> stonith-enabled="true" \
>>>> startup-fencing="false" \
>>>> stonith-timeout="120s"
>>>> rsc_defaults $id="rsc-options" \
>>>> resource-stickiness="INFINITY" \
>>>> migration-threshold="1"
>>>>
>>>>
>>>> * 1. terminate stonithd forcibly.
>>>>
>>>> [root at vm1 ~]# pkill -9 stonithd
>>>>
>>>>
>>>> * 2. I cause STONITH, but stonithd says that a device is not found and
>>>> does not STONITH.
>>>>
>>>> [root at vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
>>>> [root at vm1 ~]# grep Found /var/log/ha-debug
>>>> May 2 16:13:07 vm1 stonith-ng[15115]: debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>> May 2 16:13:19 vm1 stonith-ng[15115]: debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>> May 2 16:13:31 vm1 stonith-ng[15115]: debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>> May 2 16:13:43 vm1 stonith-ng[15115]: debug: stonith_query: Found 0
>>>> matching devices for 'vm2'
>>>> (snip)
>>>> [root at vm1 ~]#
>>>>
>>>>
>>>> After stonithd reboots, it seems that STONITH-resource or lrmd needs
>>>> to be rebooted.. is this the designed behavior?
>>>
>>>
>>> No, that sounds like a bug.
>>>
>>>>
>>>> # crm resource restart <STONITH resource (prmStonith2)>
>>>> or
>>>> # /usr/lib64/heartbeat/lrmd -r (on the node which stonithd rebooted)
>>>>
>>>> ----
>>>> Best regards,
>>>> Kazunori INOUE
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list