[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Tue Jun 24 13:01:19 UTC 2008

Andrew Beekhof <abeekhof at suse.de> writes:

> The changes for Pacemaker 1.0 include an overhaul of configuration
> syntax.
>
> We have a few things in mind for this already, but I'd also like to
> get people's opinion on which parts need the most attention.
>
> looking forward to hearing your answers...

It's a bit late but I would like to mention some random comments 
about the configuration and about some features.
Those are from questions and requests from our customers and the field.  

1) migration-threthold
   As you've already implemented it, this is one of the most
   requested features from our customers (as Ikeda-san also mentioned
   in the other mail). Thank you for that.

   But precisely we have two scenarios to configure to:
   a) monitor NG -> stop -> start on the same node
      -> monitor NG (Nth time) -> stop -> failover to another node
   b) monitor NG -> monitor NG (Nth times) -> stop -> failover to another node

   The current pacemaker behaves as a), I think, but b) is also
   useful when you want to ignore a transient error.

2) auto-failback = on/off
   Another FAQ to us is how to configure to 'auto-failback off'.
   Currently we achieve this by 'default-resource-stickiness=INFINITY'
   but it would be great to have a more purpose-oriented parameter
   (that makes easy to understand to users).

3) the standard location of the "initial (or bootstrap) cib.xml"
   I saw many people confusing where to store the cib.xml and
   how to start at the first boot time. Then they would use
   different ways each other (one may use cibadmin -U,  other
   may place it into /var/lib/heartbeat/crm/ by hands, etc. and
   the original cib.xml would be gone somewhere) .

   I think it would be good to have the standard location of
   the initial cib.xml and provide the official procedure to
   bootstrap with using it.

4) node fencing without the poweroff
   (this is a kind of a new feature request)
   Node fencing is just simple and good enough in most of our cases but
   we hesitate to use STONITH(poweroff/reboot) as the first action
   of a failure, because:
   - we want to shutdown the services gracefully as long as possible.
   - rebooting the failed node may lose the evidence of the
     real cause of a failure. We want to preserve it as possible
     to investigate it later and to ensure that the all problems are resolved.

   We think that, ideally, when a resource failed the node would
   try to go to 'standby' state, and only when it failed it
   would escalate to STONITH to poweroff.

5) STONITH priority
   Another reason why we hesitate using STONITH is the "cross counter"
   problem when split-brain occured.
   It would be great if we can tune so that a node with resouces running
   is most likely to survive.

6) node fencing when the connectivity failure is detected by pingd.
   Currently we have to have the pingd constrains for all resources.
   It woule be helpful to simplify the config and the recovery operation
   if we could configure the behavior as same as a resource failure.

Regarding to 1)-b), 4) and 5), I and my colleagues think that they
are important and we're now studying how we can implement them.

I hope it would help for the evolution of Pacemaker.

Thanks,

Keisuke MORI
NTT DATA Intellilink Corporation