[Pacemaker] Two node cluster and no hardware device for stonith.

Tue Feb 10 09:58:57 EST 2015

On Mon, Feb 09, 2015 at 04:41:19PM +0100, Lars Ellenberg wrote:
> On Fri, Feb 06, 2015 at 04:15:44PM +0100, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote:
> > > That is the problem that makes geo-clustering very hard to nearly
> > > impossible. You can look at the Booth option for pacemaker, but that
> > > requires two (or more) full clusters, plus an arbitrator 3rd
> > 
> > A full cluster can consist of one node only. Hence, it is
> > possible to have a kind of stretch two-node [multi-site] cluster
> > based on tickets and managed by booth.
> 
> In theory.
> 
> In practice, we rely on "proper behaviour" of "the other site",
> in case a ticket is revoked, or cannot be renewed.
> 
> Relying on a single node for "proper behaviour" does not inspire
> as much confidence as relying on a multi-node HA-cluster at each site,
> which we can expect to ensure internal fencing.
> 
> With reliable hardware watchdogs, it still should be ok to do
> "stretched two node HA clusters" in a reliable way.
> 
> Be generous with timeouts.

As always.

> And document which failure modes you expect to handle,
> and how to deal with the worst-case scenarios if you end up with some
> failure case that you are not equipped to handle properly.
> 
> There are deployments which favor
> "rather online with _potential_ split brain" over
> "rather offline just in case".

There's an arbitrator which should help in case of split brain.

> Document this, print it out on paper,
> 
>    "I am aware that this may lead to lost transactions,
>    data divergence, data corruption, or data loss.
>    I am personally willing to take the blame,
>    and live with the consequences."
> 
> Have some "boss" sign that ^^^
> in the real world using a real pen.

Well, of course running such a "stretch" cluster would be
rather different from a "normal" one.

The essential thing is that there's no fencing, unless configured
as a dead-man switch for the ticket. Given that booth has a
"sanity" program hook, maybe that could be utilized to verify if
this side of the cluster is healthy enough.

Thanks,

Dejan

> 	Lars
> 
> -- 
> : Lars Ellenberg
> : http://www.LINBIT.com | Your Way to High Availability
> : DRBD, Linux-HA  and  Pacemaker support and consulting
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org