[Pacemaker] RFC: What part of the XML configuration do you hate the most?
Satomi Taniguchi
taniguchis at intellilink.co.jp
Tue Aug 12 11:52:05 UTC 2008
Hi Andrew,
Andrew Beekhof wrote:
>
(snip)
>
> no, i'm indicating that you've underestimated the scope of the problem
>
(snip)
Bugzilla #1601 is caused by moving healthy resource in STONITH ordering,
isn't it?
I changed nothing about STONITH action when I implemented on_fail="standby".
On the failure of stop operation or when Sprit-Brain occurs,
I completely agree with that on_fail should be "fence".
But I consider about start or monitor operation's failure.
And on_fail="standby" is on the assumption that it is used only for
these operations.
Its purpose is not to move healthy resources before doing STONITH,
but to move all resources away from the node which a resouce is failed.
And in any operation, Bugzilla#1601 doesn't occur because I changed
nothing about STONITH.
STONITH doesn't require to stop any resources.
The following is why I make much of start and monitor operations.
What I regard seriously are:
- 1)On a resource's failure, only the failed resource
and resources which are in the same group move from
the failed node.
-> At present, to move all resources (even if they are not
in the group or have no constraints) away from
the failed node automatically, on_fail setting of
not only stop but start and monitor has to be set
"fence" and the failure node has to be killed by STONITH.
- 2)(In connection with 1) When resources are moved away by failure
of start or monitor operation, they should be shutdown normally.
-> It sounds extremely normal, but it is impossible
if you accord with 1).
-> Of course, I know that I have to kill the failed node
immediately if stop operation's failure or Split-Brain occurs.
- 3)Rebooting the failed node may lose the evidence of
the real cause of a failure
(nearly equal administrators can't analyse the failure).
-> This is as Keisuke-san wrote before.
It is a really serious matter in Enterprise services.
To solve the matters above, I implemented on_fail="standby".
If you have any other ideas to solve them, please let me know.
Just for reference, there is an example in attached files:
a resource group named "grpPostgreSQLDB" consists of
IPaddr("prmIpPostgreSQLDB") and pgsql("prmApPostgreSQLDB") is working on
node2.
(See: crm_mon_before.log)
I modified pgsql's stop function to always return $OCF_ERR_GENERIC.
When IPaddr resource failed, and its monitor's on_fail is "standby",
pgsql tried to stop but it failed.
(See: pe-warn-0.node2.gif)
Then STONITH was executed according to the setting of pgsql's stop
operation, on_fail="fence".
(See: pe-warn-1.node2.gif and pe-warn-0.node1.gif)
STONITH killed node2 pitilessly, and both resources of the group moved
to node1 peacefully.
(See: crm_mon_after.log)
Best Regards,
Satomi Taniguchi
Andrew Beekhof wrote:
>
> On Aug 4, 2008, at 8:11 AM, Satomi Taniguchi wrote:
>
>> Hi Andrew,
>>
>> Thank you for your opitions!
>> But I'm afraid that you've misunderstood my intentions...
>
> no, i'm indicating that you've underestimated the scope of the problem
>
>>
>>
>> Andrew Beekhof wrote:
>> (snip)
>>> Two problems...
>>> The first is that standby happens after the fencing event, so it's
>>> not really doing anything to migrate the healthy resources.
>>
>> In the graph, the object "stonith-1 stop 0 rh5node1" just means
>> "a plugin named stonith-1 on rh5node1 stops",
>> not "fencing event occurs".
>>
>> For example, Node1 has two resource groups.
>> When a resource in one group is failed,
>> all resources in both groups stopped completely,
>> and stonith plugin on Node1 stopped.
>> After this, both resource group work on Node2.
>> I attacched a graph, cib.xml
>> and crm_mon's logs (before and after a resource broke down).
>> Please see them.
>>
>>
>>> Stop RscZ -(depends on)-> Stop RscY -(depends on)-> Stonith NodeX
>>> -(depends on)-> Stop RscZ -(depends on)-> ...
>> I just want to stop all resources without STONITH when monitor NG,
>> I don't want to change any actions when stop NG.
>> The setting on_fail="standby" is for start or monitor operation, and
>> it is on condition that the setting of stop operation's on_fail is
>> "fence".
>> Then, STONITH is not executed when start or monitor is failed,
>> but it is executed when stop is failed.
>>
>> So, if RscY's monitor operation is failed,
>> its stop operation doesn't depend on "Sonith NodeX".
>> And if it is failed to stop RscY,
>> NodeX is turned off by STONITH, and the loop above does not occur.
>>
>>
>> Best Regards,
>> Satomi Taniguchi
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at clusterlabs.org
>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failed_on_stop_op.zip
Type: application/x-zip-compressed
Size: 193036 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080812/f06c2e4a/attachment-0002.bin>
More information about the Pacemaker
mailing list