[Pacemaker] Re: crm_mon shows nothing about stonith 'reset' failure

Tue Sep 16 10:25:15 UTC 2008

On Tue, Sep 16, 2008 at 11:48, Takenaka Kazuhiro
<takenaka.kazuhiro at oss.ntt.co.jp> wrote:
> Hi, Andrew
>
>> Nope.
>> This is not stored anywhere since there is nowhere it can be
>> reconstructed from (like the lrmd for resource operations) when
>> rebuilding the status section.
>
> Why does the current cib.xml definition have no room for
> stonith 'reset' failures? Simply not implemented? Or is
> there any other reason?

I already gave you the reason.

>> ... since there is nowhere it can be
>> reconstructed from (like the lrmd for resource operations) when
>> rebuilding the status section.

The whole status section is periodically reconstructed - so any
stonith failures that were recorded there could be lost at any time.
So rather than store inconsistent and possibly incorrect data, we
don't store anything.

>
>> And if your stonith resources are failing, a) you have bigger
>> problems, and b) you'll get nice big ERROR messages in the logs.
>
> a) I saw 'dummy' didn't fail over. Is this a "bigger problems"?

Depends what 'dummy' is.
But assuming its just a resource then no, that's the least of your problems.

STONITH is the single most critical part of the cluster.
Without a reliable STONITH mechanism, your cluster will not be able to
recover after some failures or, even worse, try to recover when it
should not have and corrupt all your data.

So if your STONITH mechanism is broken, then very clearly, _that_ is
your biggest problem.

>
> b) The only way to know stonith 'reset' failures is watching
>   the logs. Do I understand right?

Unless something in stonithd changes. Yes.

>
>> On Tue, Sep 16, 2008 at 03:11, Takenaka Kazuhiro
>> <takenaka.kazuhiro at oss.ntt.co.jp> wrote:
>>>
>>> > Hi All,
>>> >
>>> > I ran a test to see what would happen when stonith 'reset' failed.
>>> > Before the test, I thought 'crm_mon' should show something about the
>>> > failure.
>>
>> Nope.
>> This is not stored anywhere since there is nowhere it can be
>> reconstructed from (like the lrmd for resource operations) when
>> rebuilding the status section.
>>
>> And if your stonith resources are failing, a) you have bigger
>> problems, and b) you'll get nice big ERROR messages in the logs.
>>
>>> > But 'crm_mon' didn't show anything.
>>> >
>>> > What I did is the following.
>>> >
>>> > 1. I started the stonith-enabled two nodes cluster. The names of
>>> >   the nodes were 'node01' and 'node02'.  See configuration files
>>> >   in attached 'hb_reports.tgz' for more details.
>>> >
>>> >   I made a few modifications to 'ssh' for the test and renamed it
>>> >   to 'sshTEST'. I also attached 'sshTEST'. The diferences are
>>> >   written in it.
>>> >
>>> > 2. I performed the following command.
>>> >
>>> >   # iptables -A INPUT -i eth3 -p tcp --dport 22 -j REJECT
>>> >
>>> >   'eth3' is connected to the network for 'sshTEST'.
>>> >
>>> > 3. I deleted the state file of 'dummy' at 'node01'.
>>> >
>>> >   # rm -f /var/run/heartbeat/rsctmp/Dummy-dummy.state
>>> >
>>> > Soon the failure of 'dummy' was logged into /var/log/ha-log
>>> > and 'crm_mon' also displayed it.
>>> >
>>> > After a while the failure of 'reset' performed by 'sshTEST'
>>> > also logged, but 'crm_mon' didn't display it.
>>> >
>>> > Did I make any misconfigurations or any misoperations that
>>> > made 'crm_mon' work incorrectly.
>>> >
>>> > Or 'crm_mon' really don't show anything about stonith 'reset'
>>> > failure ?
>>> >
>>> > I used Heartbeat(e8154a602bf4) + Pacemaker(d4a14f276c28)
>>> > for this test.
>>> >
>>> > Best regard.
>>> > --
>>> > Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
>
>
> --
> Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
>