[Pacemaker] Re: Problems when DC node is STONITH'ed.

Thu Oct 16 10:21:13 UTC 2008

Hi Satomi-san,

On Thu, Oct 16, 2008 at 03:43:36PM +0900, Satomi TANIGUCHI wrote:
> Hi Dejan,
>
>
> Dejan Muhamedagic wrote:
>> Hi Satomi-san,
>>
>> On Tue, Oct 14, 2008 at 07:07:00PM +0900, Satomi TANIGUCHI wrote:
>>> Hi,
>>>
>>> I found that there are 2 problems when DC node is STONITH'ed.
>>> (1) STONITH operation is executed two times.
>>
>> This has been discussed at length in bugzilla, see
>>
>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904
>>
>> which was resolved with WONTFIX. In short, it was deemed to risky
>> to implement a remedy for this problem.  Of course, if you think
>> you can add more to the discussion, please go ahead.
> Sorry, I missed it.

Well, you couldn't have known about it :)

> Thank you for your pointing!
> I understand how it came about.
>
> Ideally, when DC-node is going to be STONITH'ed,
> the new DC-node is elected and it STONITHs the ex-DC,
> then these problems will not occur.
> But maybe it is not good way from the viewpoint of emergency
> because the ex-DC should be STONITH'ed as soon as possible.

Yes, you're right about this.

> Anyway, I understand this is an expected behavior, thanks!
> But then, it seems that tengine has to keep having a timeout for waiting
> stonithd's result, and long cluster-delay is still required.

If I understood Andrew correctly, the tengine will wait forever,
until stonithd sends a message. Or dies which, let's hope, won't
happen.

> Because second STONITH is requested on that transition timeout.
> I'm afraid that I misunderstood the true meaning of what Andrew said.

In the bugzilla? If so, please reopen and voice your concerns.

>>> (2) Timeout-value which stonithd on DC node waits to reply
>>>     the result of STONITH op from other node is
>>>     always set to "stonith-timeout" in <cluster_property_set>.
>>> [...]
>>> The case (2):
>>> When this timeout occurs on stonithd on DC
>>> during non-DC node's stonithd tries to reset DC,
>>> DC-stonithd will send a request to other node,
>>> and two or more STONITH plugins are executed in parallel.
>>> This is a troublesome problem.
>>> The most suitable value as this timeout might be
>>> the sum total of "stonith-timeout" of STONITH plugins on the node
>>> which is going to receive the STONITH request from DC node, I think.
>>
>> This would probably be very difficult for the CRM to get.
> Right, I agree with you.
> I meant "it is difficult because stonithd on DC can't know the values of
> stonith-timeout on other node." with the following sentence
> "But DC node can't know that...".
>>
>>> But DC node can't know that...
>>> I would like to hear your opinions.
>>
>> Sorry, but I couldn't exactly follow. Could you please describe
>> it in terms of actions.
> Sorry, I restate what I meant.
> The timeout which stonithd on DC waits for the return of other node's
> stonithd needs the value that is longer than the sum total of "stonith-timeout"
> of STONITH plugins on the node by all rights.
> But it is so difficult to get the values for DC-stonithd.
> Then I would like to hear your opinion about what is suitable and practical
> value as this timeout which is set in insert_into_executing_queue().
> I hope I conveyed to you what I want to say.

OK, I suppose I understand now. You're talking about the timeouts
for remote fencing operations, right? And the originating
stonithd hasn't got a clue on how long the remote fencing
operation may take. Well, that could be a problem. I can't think
of anything to resolve that completely, not without "rewiring"
stonithd. stonithd broadcasts the request so there's no way for
it to know who's doing what and when and how long it can take.

The only workaround I can think of is to use the global (cluster
property) stonith-timeout which should be set to the maximum sum
of stonith timeouts for a node.

Now, back to reality ;-)  Timeouts are important, of course, but
one should usually leave a generous margin on top of the expected
duration. For instance, if the normal timeout for an operation on
a device is 30 seconds, there's nothing wrong in setting it to
say one or two minutes. The consequences of an operation ending
prematurely are much more serious than if one waits a bit longer.
After all, if there's something really wrong, it is usually
detected early and the error reported immediately. Of course,
one shouldn't follow this advice blindly. Know your cluster!

> For reference, I attached logs when the aforesaid timeout occurs.
> The cluster has 3 nodes.
> When DC was going to be STONITH'ed, DC sent a request all of non-DC nodes,
> and all of them tried to shutdown DC.

No, the tengine (running on DC) always talks to the local
stonithd.

> And the timeout on DC-stonithd occured, DC-stonithd sent the same request,
> then two or more STONITH plugin worked in parallel on every non-DC node.
> (Please see sysstats.txt.)
>
> I want to make clear whether the current behavior is expected or a bug.

That's actually wrong, but could be considered a configuration
problem:

  <cluster_property_set id="cib-bootstrap-options">
  ...
        <nvpair id="nvpair.id2000009" name="stonith-timeout" value="260s"/>
  ...
  <primitive id="prmStonithN1" class="stonith" type="external/ssh">
  ...
	  <nvpair id="nvpair.id2000602" name="stonith-timeout" value="390s"/>

The stonithd initiator (the one running on the DC) times out
before the remote fencing operation. On retry a second remote
fencing operation is started. That's why you see two of them.

Anyway, you can open a bugzilla for this, because the stonithd on
a remote host should know that there's already one operation
running. Unfortunately, I'm busy with more urgent matters right
now, so it may take a few weeks until I take a look at it.
As usual, patches are welcome :)

Thanks,

Dejan

> But I consider that the root of every problem is the node which sends STONITH
> request and wait for completion of the op is killed.
>
>
> Regards,
> Satomi TANIGUCHI
>
>
>>
>> Thanks,
>>
>> Dejan
>>
>>> Best Regards,
>>> Satomi TANIGUCHI
>>
>>
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>>
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>
>
>
>
>

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker