[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Tue Jun 24 14:26:14 UTC 2008

On Tue, Jun 24, 2008 at 04:02:06PM +0200, Lars Marowsky-Bree wrote:
> On 2008-06-24T15:48:12, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> 
> > >    But precisely we have two scenarios to configure to:
> > >    a) monitor NG -> stop -> start on the same node
> > >       -> monitor NG (Nth time) -> stop -> failover to another node
> > >    b) monitor NG -> monitor NG (Nth times) -> stop -> failover to another node
> > > 
> > >    The current pacemaker behaves as a), I think, but b) is also
> > >    useful when you want to ignore a transient error.
> > 
> > The b) part has already been discussed on the list and it's
> > supposed to be implemented in lrmd. I still don't have the API
> > defined, but thought about something like
> > 
> > 	max-total-failures (how many times a monitor may fail)
> > 	max-consecutive-failures (how many times in a row a monitor may fail)
> > 
> > These should probably be attributes defined on the monitor
> > operation level.
> 
> The "ignore failure reports" clashes a bit with the "react to failures
> ASAP" requirement.
> 
> It is my belief that this should be handled by the RA, not in the LRM
> nor the CRM. The monitor op implementation is the place to handle this.
> 
> Beyond that, I strongly feel that "transient errors" are a bad
> foundation to build clusters on.

Of course, all that is right. However, there are some situations
where we could bend the rules. I'm not sure what Keisuke-san had
in mind, but for example one could be more forgiving when
monitoring certain stonith resources.

Thank,

Dejan