[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

Mon May 25 11:08:56 EDT 2009

Hello everyone,

I realize this is primarily an OpenAIS issue, but let's discuss it here
anyway to share some thoughts.

In Heartbeat-based clusters, we've always advised customers to use
redundant network communication links. Given the fact that most of the
clusters we build are DRBD based, we practically always have a second
network link (the dedicated DRBD replication link) available for this
purpose. In Heartbeat, when links get interrupted it's actually somewhat
nontrivial to notice (which sucks), but links recover automatically when
they are re-established (which is good).

Now in OpenAIS, when we configure RRP and a link breaks, OpenAIS
complains very loudly (which is good), but eventually the link settles
in a faulty state from which it can only be re-enabled using
"openais-cfgtool -r". Clearly this breaks the concept of a self-healing
system.

This discussion has been had before over on the openais list
(http://www.mail-archive.com/openais@lists.linux-foundation.org/msg01205.html),
but AFAICS it hasn't come to any reasonable conclusion. So my question
is, what is the best practice for redundant network setups that should
be included in the Pacemaker docs?

1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold
ridiculously high so the ring status never goes to faulty. (It seems
that RRP "problem counting" can't be disabled altogether).

2. Have package maintainers include some magic that does
"openais-cfgtool -r" every time a network link changes its status to UP
(where the network management subsystem permits this).

3. Instruct users to install cron jobs that do "openais-cfgtool -r" in
specified intervals, causing OpenAIS to re-check the link status
periodically.

4. Something else I haven't thought about.

Thoughts? Comments?

Cheers,
Florian