[Pacemaker] Re: crm_mon shows nothing about stonith 'reset' failure
Takenaka Kazuhiro
takenaka.kazuhiro at oss.ntt.co.jp
Wed Sep 17 01:07:02 UTC 2008
Hi Andrew,
> The whole status section is periodically reconstructed - so any
> stonith failures that were recorded there could be lost at any time.
> So rather than store inconsistent and possibly incorrect data, we
> don't store anything.
Thanks for more detailed explanation.
> STONITH is the single most critical part of the cluster.
> Without a reliable STONITH mechanism, your cluster will not be able to
> recover after some failures or, even worse, try to recover when it
> should not have and corrupt all your data.
>
>
> So if your STONITH mechanism is broken, then very clearly, _that_ is
> your biggest problem.
>
>
>> >
>> > b) The only way to know stonith 'reset' failures is watching
>> > the logs. Do I understand right?
>
> Unless something in stonithd changes. Yes.
Hmm... If STONITH is so much important, all the more there should
be a intuitive way to know its activity.
I will post my ideas if they will come to my mind.
> On Tue, Sep 16, 2008 at 11:48, Takenaka Kazuhiro
> <takenaka.kazuhiro at oss.ntt.co.jp> wrote:
>> > Hi, Andrew
>> >
>>> >> Nope.
>>> >> This is not stored anywhere since there is nowhere it can be
>>> >> reconstructed from (like the lrmd for resource operations) when
>>> >> rebuilding the status section.
>> >
>> > Why does the current cib.xml definition have no room for
>> > stonith 'reset' failures? Simply not implemented? Or is
>> > there any other reason?
>
> I already gave you the reason.
>
>>> >> ... since there is nowhere it can be
>>> >> reconstructed from (like the lrmd for resource operations) when
>>> >> rebuilding the status section.
>
> The whole status section is periodically reconstructed - so any
> stonith failures that were recorded there could be lost at any time.
> So rather than store inconsistent and possibly incorrect data, we
> don't store anything.
>
>> >
>>> >> And if your stonith resources are failing, a) you have bigger
>>> >> problems, and b) you'll get nice big ERROR messages in the logs.
>> >
>> > a) I saw 'dummy' didn't fail over. Is this a "bigger problems"?
>
> Depends what 'dummy' is.
> But assuming its just a resource then no, that's the least of your problems.
>
>
> STONITH is the single most critical part of the cluster.
> Without a reliable STONITH mechanism, your cluster will not be able to
> recover after some failures or, even worse, try to recover when it
> should not have and corrupt all your data.
>
>
> So if your STONITH mechanism is broken, then very clearly, _that_ is
> your biggest problem.
>
>
>> >
>> > b) The only way to know stonith 'reset' failures is watching
>> > the logs. Do I understand right?
>
> Unless something in stonithd changes. Yes.
>
>> >
>>> >> On Tue, Sep 16, 2008 at 03:11, Takenaka Kazuhiro
>>> >> <takenaka.kazuhiro at oss.ntt.co.jp> wrote:
>>>> >>>
>>>>> >>> > Hi All,
>>>>> >>> >
>>>>> >>> > I ran a test to see what would happen when stonith 'reset' failed.
>>>>> >>> > Before the test, I thought 'crm_mon' should show something about the
>>>>> >>> > failure.
>>> >>
>>> >> Nope.
>>> >> This is not stored anywhere since there is nowhere it can be
>>> >> reconstructed from (like the lrmd for resource operations) when
>>> >> rebuilding the status section.
>>> >>
>>> >> And if your stonith resources are failing, a) you have bigger
>>> >> problems, and b) you'll get nice big ERROR messages in the logs.
>>> >>
>>>>> >>> > But 'crm_mon' didn't show anything.
>>>>> >>> >
>>>>> >>> > What I did is the following.
>>>>> >>> >
>>>>> >>> > 1. I started the stonith-enabled two nodes cluster. The names of
>>>>> >>> > the nodes were 'node01' and 'node02'. See configuration files
>>>>> >>> > in attached 'hb_reports.tgz' for more details.
>>>>> >>> >
>>>>> >>> > I made a few modifications to 'ssh' for the test and renamed it
>>>>> >>> > to 'sshTEST'. I also attached 'sshTEST'. The diferences are
>>>>> >>> > written in it.
>>>>> >>> >
>>>>> >>> > 2. I performed the following command.
>>>>> >>> >
>>>>> >>> > # iptables -A INPUT -i eth3 -p tcp --dport 22 -j REJECT
>>>>> >>> >
>>>>> >>> > 'eth3' is connected to the network for 'sshTEST'.
>>>>> >>> >
>>>>> >>> > 3. I deleted the state file of 'dummy' at 'node01'.
>>>>> >>> >
>>>>> >>> > # rm -f /var/run/heartbeat/rsctmp/Dummy-dummy.state
>>>>> >>> >
>>>>> >>> > Soon the failure of 'dummy' was logged into /var/log/ha-log
>>>>> >>> > and 'crm_mon' also displayed it.
>>>>> >>> >
>>>>> >>> > After a while the failure of 'reset' performed by 'sshTEST'
>>>>> >>> > also logged, but 'crm_mon' didn't display it.
>>>>> >>> >
>>>>> >>> > Did I make any misconfigurations or any misoperations that
>>>>> >>> > made 'crm_mon' work incorrectly.
>>>>> >>> >
>>>>> >>> > Or 'crm_mon' really don't show anything about stonith 'reset'
>>>>> >>> > failure ?
>>>>> >>> >
>>>>> >>> > I used Heartbeat(e8154a602bf4) + Pacemaker(d4a14f276c28)
>>>>> >>> > for this test.
>>>>> >>> >
>>>>> >>> > Best regard.
>>>>> >>> > --
>>>>> >>> > Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
>> >
>> >
>> > --
>> > Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
>> >
>> > _______________________________________________
>> > Pacemaker mailing list
>> > Pacemaker at clusterlabs.org
>> > http://list.clusterlabs.org/mailman/listinfo/pacemaker
>> >
--
Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
More information about the Pacemaker
mailing list