[Pacemaker] STONITH Deathmatch Explained
Dejan Muhamedagic
dejanmm at fastmail.fm
Thu May 14 09:32:09 UTC 2009
Hi,
On Thu, May 14, 2009 at 06:32:00PM +1000, Tim Serong wrote:
> Greetings,
>
> I've written up a brief document entitled "STONITH Deathmatch Explained
> (and Some Hints for Resource Agent Authors and Systems Engineers)":
>
> http://ourobengr.com/ha
>
> It's a description of causes of STONITH deathmatch in
> Heartbeat/Pacemaker HA clusters, where two nodes continually shoot each
> other, thus rendering the system less available than a non-HA system
> would be.
>
> Hopefully publishing this will save at least a few people from some of
> the pain myself and a couple of others experienced last year, in
> particular when trying to debug resource agents that were misbehaving in
> unexpected ways.
>
> Comments, feedback, etc. welcome.
Great document! A very funny illustration too :)
Just a few remarks:
- in "Causes ..." you missed to mention split-brain (no
communication channels working) and, at the same time, to
stress how important it is to have redundant communications :)
- even though you mention that in the title, I'd still move the
resource agent intricacies into another document; they are all
very valuable advice, but of no concern to cluster
administrators; it's also good to keep the focus on our little
problem; then you'll have to find other "Things You Didn't
Think Of" (or just keep the title and leave the section empty:
it is important; or insert another illustration)
- devote more space/thought to the part on how to avoid a
"deathmatch"; there's only a mention on chkconfig within
"Debugging ..." (or one can also use the "poweroff" fencing
operation); also, note that this occurs only in cases reboot
doesn't fix a problem (e.g. split-brain)
Thanks,
Dejan
> Thanks,
>
> Tim
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
More information about the Pacemaker
mailing list