[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

Lars Marowsky-Bree lmb at suse.de
Thu Jun 4 12:05:23 EDT 2009


On 2009-05-25T18:10:32, Florian Haas <florian.haas at linbit.com> wrote:

> I've repeatedly told customers that NIC bonding is not a valid
> substitute for redundant Heartbeat links, I will stubbornly insist it
> isn't one for OpenAIS RRP links either.

I think your stubborness is misguided, actually. I've had a similar
initial reaction when I looked at this - before ending up to recommend
bonding - but it turns out that bonding seemed actually preferable.

The downside with RRP, as mentioned on IRC, is that it is "only"
available to OpenAIS clients. The DLM and drbd and other software
however opens independent TCP connections, not to mention the
server-client connectivity, which only benefits if bonding is used.

> Some reasons:

These reasons are all technically valid, but I don't think they outweigh
the benefit from getting redundancy for all cluster communications.

> - You're not protected against bugs, currently known or unknown, in the
> bonding driver. If bonding itself breaks, you're screwed.

The same is true for bugs in the network stack in general.

> - Most people actually run bonding over interfaces over the same make,
> model, and chipset. That's not necessarily optimal, but it's a reality.
> Thus, if your driver breaks, you're screwed again. Granted, this is
> probably to if you ran two RRP links in that same configuration too.

Exactly.

Some of this can be balanced by running at least different NICs in
different nodes, which mitigates the problem at the cluster level, even
if a single node goes down.

> - Finally, you can't bond between a switched and a direct back-to-back
> connection, which makes bonding entirely unsuitable for the redundant
> links use case I described earlier.

Yes, bonding has a different deployment mode than the scenario you
described. On the other hand, modifying the deployment scenario would
give you more redundancy even for the replication, which has benefits
too.

> That I fully agree with. The question is what "working properly" means
> in this case -- should it be capable of auto-recovery, or should it not?

Despite the above arguments that nowadays I'd design my clusters with
bonding in mind, I of course agree that RRP _should_ work. 

Just like drbd/DLM/etc should work with SCTP to make use of the
redundant, un-bonded links.

But for the time being, I think bonded NICs is overall the best
solution.


Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde





More information about the Pacemaker mailing list