[Pacemaker] STONITH Deathmatch Explained

Joe Armstrong jarmstrong at postpath.com
Thu May 14 15:59:54 UTC 2009


>On Thu, May 14, 2009 at 06:32:00PM +1000, Tim Serong wrote:
>> Greetings,
>> 
>> I've written up a brief document entitled "STONITH Deathmatch Explained
>> (and Some Hints for Resource Agent Authors and Systems Engineers)":
>> 
>>   http://ourobengr.com/ha
>> 
>> It's a description of causes of STONITH deathmatch in
>> Heartbeat/Pacemaker HA clusters, where two nodes continually shoot each
>> other, thus rendering the system less available than a non-HA system
>> would be.
>> 
>> Hopefully publishing this will save at least a few people from some of
>> the pain myself and a couple of others experienced last year, in
>> particular when trying to debug resource agents that were misbehaving in
>> unexpected ways.
>> 
>> Comments, feedback, etc. welcome.
>
>Great document! A very funny illustration too :)
>
>Just a few remarks:
>
>- in "Causes ..." you missed to mention split-brain (no
>  communication channels working) and, at the same time, to
>  stress how important it is to have redundant communications :)
>
>- even though you mention that in the title, I'd still move the
>  resource agent intricacies into another document; they are all
>  very valuable advice, but of no concern to cluster
>  administrators; it's also good to keep the focus on our little
>  problem; then you'll have to find other "Things You Didn't
>  Think Of" (or just keep the title and leave the section empty:
>  it is important; or insert another illustration)
>
>- devote more space/thought to the part on how to avoid a
>  "deathmatch"; there's only a mention on chkconfig within
>  "Debugging ..." (or one can also use the "poweroff" fencing
>  operation); also, note that this occurs only in cases reboot
>  doesn't fix a problem (e.g. split-brain)
>
>Thanks,
>
>Dejan


I agree - a nice read.  You might want to also add a possibility to avoid the situation.  Don't allow heartbeat to be started by the RC scripts.  Once a machine has been STONITH'd you can consider that it is untrustworthy until the admin inspects the reason for the failure and manually allows the node back into the cluster.  This same thinking is why I hate auto-failback...

Joe




More information about the Pacemaker mailing list