[Pacemaker] RFC: What part of the XML configuration do you hate
the most?
Satomi Taniguchi
taniguchis at intellilink.co.jp
Tue Aug 12 13:52:05 CEST 2008
Hi Andrew,
Andrew Beekhof wrote:
>
(snip)
>
> no, i'm indicating that you've underestimated the scope of the problem
>
(snip)
Bugzilla #1601 is caused by moving healthy resource in STONITH ordering, =
isn't it?
I changed nothing about STONITH action when I implemented on_fail=3D"standb=
y".
On the failure of stop operation or when Sprit-Brain occurs,
I completely agree with that on_fail should be "fence".
But I consider about start or monitor operation's failure.
And on_fail=3D"standby" is on the assumption that it is used only for =
these operations.
Its purpose is not to move healthy resources before doing STONITH,
but to move all resources away from the node which a resouce is failed.
And in any operation, Bugzilla#1601 doesn't occur because I changed =
nothing about STONITH.
STONITH doesn't require to stop any resources.
The following is why I make much of start and monitor operations.
What I regard seriously are:
- 1)On a resource's failure, only the failed resource
and resources which are in the same group move from
the failed node.
-> At present, to move all resources (even if they are not
in the group or have no constraints) away from
the failed node automatically, on_fail setting of
not only stop but start and monitor has to be set
"fence" and the failure node has to be killed by STONITH.
- 2)(In connection with 1) When resources are moved away by failure
of start or monitor operation, they should be shutdown normally.
-> It sounds extremely normal, but it is impossible
if you accord with 1).
-> Of course, I know that I have to kill the failed node
immediately if stop operation's failure or Split-Brain occurs.
- 3)Rebooting the failed node may lose the evidence of
the real cause of a failure
(nearly equal administrators can't analyse the failure).
-> This is as Keisuke-san wrote before.
It is a really serious matter in Enterprise services.
To solve the matters above, I implemented on_fail=3D"standby".
If you have any other ideas to solve them, please let me know.
Just for reference, there is an example in attached files:
a resource group named "grpPostgreSQLDB" consists of =
IPaddr("prmIpPostgreSQLDB") and pgsql("prmApPostgreSQLDB") is working on =
node2.
(See: crm_mon_before.log)
I modified pgsql's stop function to always return $OCF_ERR_GENERIC.
When IPaddr resource failed, and its monitor's on_fail is "standby", =
pgsql tried to stop but it failed.
(See: pe-warn-0.node2.gif)
Then STONITH was executed according to the setting of pgsql's stop =
operation, on_fail=3D"fence".
(See: pe-warn-1.node2.gif and pe-warn-0.node1.gif)
STONITH killed node2 pitilessly, and both resources of the group moved =
to node1 peacefully.
(See: crm_mon_after.log)
Best Regards,
Satomi Taniguchi
Andrew Beekhof wrote:
> =
> On Aug 4, 2008, at 8:11 AM, Satomi Taniguchi wrote:
> =
>> Hi Andrew,
>>
>> Thank you for your opitions!
>> But I'm afraid that you've misunderstood my intentions...
> =
> no, i'm indicating that you've underestimated the scope of the problem
> =
>>
>>
>> Andrew Beekhof wrote:
>> (snip)
>>> Two problems...
>>> The first is that standby happens after the fencing event, so it's =
>>> not really doing anything to migrate the healthy resources.
>>
>> In the graph, the object "stonith-1 stop 0 rh5node1" just means
>> "a plugin named stonith-1 on rh5node1 stops",
>> not "fencing event occurs".
>>
>> For example, Node1 has two resource groups.
>> When a resource in one group is failed,
>> all resources in both groups stopped completely,
>> and stonith plugin on Node1 stopped.
>> After this, both resource group work on Node2.
>> I attacched a graph, cib.xml
>> and crm_mon's logs (before and after a resource broke down).
>> Please see them.
>>
>>
>>> Stop RscZ -(depends on)-> Stop RscY -(depends on)-> Stonith NodeX =
>>> -(depends on)-> Stop RscZ -(depends on)-> ...
>> I just want to stop all resources without STONITH when monitor NG,
>> I don't want to change any actions when stop NG.
>> The setting on_fail=3D"standby" is for start or monitor operation, and
>> it is on condition that the setting of stop operation's on_fail is =
>> "fence".
>> Then, STONITH is not executed when start or monitor is failed,
>> but it is executed when stop is failed.
>>
>> So, if RscY's monitor operation is failed,
>> its stop operation doesn't depend on "Sonith NodeX".
>> And if it is failed to stop RscY,
>> NodeX is turned off by STONITH, and the loop above does not occur.
>>
>>
>> Best Regards,
>> Satomi Taniguchi
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at clusterlabs.org
>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> =
> =
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failed_on_stop_op.zip
Type: application/x-zip-compressed
Size: 193036 bytes
Desc: not available
Url : http://list.clusterlabs.org/pipermail/pacemaker/attachments/20080812/=
f06c2e4a/failed_on_stop_op-0001.bin
More information about the Pacemaker
mailing list