[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
Andrew Beekhof
andrew at beekhof.net
Mon May 25 15:45:14 UTC 2009
On Mon, May 25, 2009 at 5:08 PM, Florian Haas <florian at linbit.com> wrote:
> Hello everyone,
>
> I realize this is primarily an OpenAIS issue, but let's discuss it here
> anyway to share some thoughts.
>
> In Heartbeat-based clusters, we've always advised customers to use
> redundant network communication links. Given the fact that most of the
> clusters we build are DRBD based, we practically always have a second
> network link (the dedicated DRBD replication link) available for this
> purpose. In Heartbeat, when links get interrupted it's actually somewhat
> nontrivial to notice (which sucks), but links recover automatically when
> they are re-established (which is good).
>
> Now in OpenAIS, when we configure RRP and a link breaks, OpenAIS
> complains very loudly (which is good), but eventually the link settles
> in a faulty state from which it can only be re-enabled using
> "openais-cfgtool -r". Clearly this breaks the concept of a self-healing
> system.
>
> This discussion has been had before over on the openais list
> (http://www.mail-archive.com/openais@lists.linux-foundation.org/msg01205.html),
> but AFAICS it hasn't come to any reasonable conclusion. So my question
> is, what is the best practice for redundant network setups that should
> be included in the Pacemaker docs?
SUSE is currently recommending NIC bonding.
We've not been able to get satisfactory behavior from clusters using RRP.
> 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold
> ridiculously high so the ring status never goes to faulty. (It seems
> that RRP "problem counting" can't be disabled altogether).
>
> 2. Have package maintainers include some magic that does
> "openais-cfgtool -r" every time a network link changes its status to UP
> (where the network management subsystem permits this).
>
> 3. Instruct users to install cron jobs that do "openais-cfgtool -r" in
> specified intervals, causing OpenAIS to re-check the link status
> periodically.
You could add it to the drbd monitor action I guess.
But it does seem sub-optimal.
I think the best solution is to work with upstream to get the feature
working properly.
>
> 4. Something else I haven't thought about.
>
> Thoughts? Comments?
>
> Cheers,
> Florian
>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
More information about the Pacemaker
mailing list