[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Satomi Taniguchi taniguchis at intellilink.co.jp
Tue Aug 12 07:52:05 EDT 2008


Hi Andrew,

Andrew Beekhof wrote:
 >
(snip)
 >
 > no, i'm indicating that you've underestimated the scope of the problem
 >
(snip)


Bugzilla #1601 is caused by moving healthy resource in STONITH ordering, 
isn't it?
I changed nothing about STONITH action when I implemented on_fail="standby".

On the failure of stop operation or when Sprit-Brain occurs,
I completely agree with that on_fail should be "fence".
But I consider about start or monitor operation's failure.
And on_fail="standby" is on the assumption that it is used only for 
these operations.
Its purpose is not to move healthy resources before doing STONITH,
but to move all resources away from the node which a resouce is failed.
And in any operation, Bugzilla#1601 doesn't occur because I changed 
nothing about STONITH.
STONITH doesn't require to stop any resources.

The following is why I make much of start and monitor operations.

What I regard seriously are:
   - 1)On a resource's failure, only the failed resource
       and resources which are in the same group move from
       the failed node.
       -> At present, to move all resources (even if they are not
          in the group or have no constraints) away from
          the failed node automatically, on_fail setting of
          not only stop but start and monitor has to be set
          "fence" and the failure node has to be killed by STONITH.
   - 2)(In connection with 1) When resources are moved away by failure
       of start or monitor operation, they should be shutdown normally.
       -> It sounds extremely normal, but it is impossible
          if you accord with 1).
       -> Of course, I know that I have to kill the failed node
          immediately if stop operation's failure or Split-Brain occurs.
   - 3)Rebooting the failed node may lose the evidence of
       the real cause of a failure
       (nearly equal administrators can't analyse the failure).
       -> This is as Keisuke-san wrote before.
          It is a really serious matter in Enterprise services.

To solve the matters above, I implemented on_fail="standby".
If you have any other ideas to solve them, please let me know.



Just for reference, there is an example in attached files:
a resource group named "grpPostgreSQLDB" consists of 
IPaddr("prmIpPostgreSQLDB") and pgsql("prmApPostgreSQLDB") is working on 
node2.
(See: crm_mon_before.log)
I modified pgsql's stop function to always return $OCF_ERR_GENERIC.
When IPaddr resource failed, and its monitor's on_fail is "standby", 
pgsql tried to stop but it failed.
(See: pe-warn-0.node2.gif)
Then STONITH was executed according to the setting of pgsql's stop 
operation, on_fail="fence".
(See: pe-warn-1.node2.gif and pe-warn-0.node1.gif)
STONITH killed node2 pitilessly, and both resources of the group moved 
to node1 peacefully.
(See: crm_mon_after.log)



Best Regards,
Satomi Taniguchi








Andrew Beekhof wrote:
> 
> On Aug 4, 2008, at 8:11 AM, Satomi Taniguchi wrote:
> 
>> Hi Andrew,
>>
>> Thank you for your opitions!
>> But I'm afraid that you've misunderstood my intentions...
> 
> no, i'm indicating that you've underestimated the scope of the problem
> 
>>
>>
>> Andrew Beekhof wrote:
>> (snip)
>>> Two problems...
>>> The first is that standby happens after the fencing event, so it's 
>>> not really doing anything to migrate the healthy resources.
>>
>> In the graph, the object "stonith-1 stop 0 rh5node1" just means
>> "a plugin named stonith-1 on rh5node1 stops",
>> not "fencing event occurs".
>>
>> For example, Node1 has two resource groups.
>> When a resource in one group is failed,
>> all resources in both groups stopped completely,
>> and stonith plugin on Node1 stopped.
>> After this, both resource group work on Node2.
>> I attacched a graph, cib.xml
>> and crm_mon's logs (before and after a resource broke down).
>> Please see them.
>>
>>
>>> Stop RscZ -(depends on)-> Stop RscY  -(depends on)-> Stonith NodeX  
>>> -(depends on)-> Stop RscZ  -(depends on)-> ...
>> I just want to stop all resources without STONITH when monitor NG,
>> I don't want to change any actions when stop NG.
>> The setting on_fail="standby" is for start or monitor operation, and
>> it is on condition that the setting of stop operation's on_fail is 
>> "fence".
>> Then, STONITH is not executed when start or monitor is failed,
>> but it is executed when stop is failed.
>>
>> So, if RscY's monitor operation is failed,
>> its stop operation doesn't depend on "Sonith NodeX".
>> And if it is failed to stop RscY,
>> NodeX is turned off by STONITH, and the loop above does not occur.
>>
>>
>> Best Regards,
>> Satomi Taniguchi
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at clusterlabs.org
>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker

-------------- next part --------------
A non-text attachment was scrubbed...
Name: failed_on_stop_op.zip
Type: application/x-zip-compressed
Size: 193036 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080812/f06c2e4a/attachment-0001.bin>


More information about the Pacemaker mailing list