[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Tue Jul 15 06:22:01 UTC 2008

On 2008-07-11T14:33:34, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:

> It is certainly a bug, but not in monitor. Recall that all the
> complexity is typically elsewhere, not in the stonith plugin.
> Often it's just that the device isn't robust enough and simply
> fails once every few hundred calls for whatever reason.

I'd argue that this ought to be handled within the plugin / agent, but
this might not be the most effective way, yes.

But the issue is that these spurious failures (which I've never seen in
practice ;-) are quite likely to be timeouts, which we can't simply
"ignore" - we can't issue a new monitor request until it has been
cleared. 

So I'd think that indeed a stop-start of the agent might be the best way
to react.

If the error is transient and only occurs once every quarter or so, the
last failure will have expired by then, and if failure-stickiness /
migration-threshold is set to 2, the stonith agent will simply and
quickly be restarted.

Seems quite fine to me.

No cluster software I ever worked with in the last 10 years needed
functionality to ignore errors. Not even ldirectord has that. I admit
that 10 years is not that much experience, but I'm open to learn new
tricks ;-)

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde