[Pacemaker] Human confirmation of dead node?

Fri Oct 16 09:26:49 EDT 2009

On Tue, Oct 13, 2009 at 06:43:53PM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Tue, Oct 13, 2009 at 05:57:25PM +0200, J Brack wrote:
> > On 10/13/09, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> > > Hi,
> > >
> > > On Tue, Oct 13, 2009 at 03:23:11PM +0200, J Brack wrote:
> > >> Hi,
> > >>
> > >> I'm currently using heartbeat. I heard that I'm meant to be using
> > >> pacemaker. I will switch in a heartbeat (sorry) if I can get pacemaker
> > >> to do what I need.
> > >
> > > http://clusterlabs.org/wiki/Project_History
> > >
> > >> I have a clustered nfs server, primary is in datacenter1 close to the
> > >> users, secondary is in datacenter2 not close to the users. There is
> > >> only an ethernet connection between the two data centers.
> > >>
> > >> In the event of a failure of the primary in datacenter1 (or of
> > >> datacenter1 itself), I would like to switch to the secondary in
> > >> datacenter2. The catch? I want a human to confirm that the primary is
> > >> really dead.
> > >>
> > >> My current heartbeat setup uses meatclient to confirm that a node has
> > >> been reset. This happens to do the same thing as confirming primary is
> > >> really dead for when primary's hardware dies - but for a network
> > >> outage I see the service bounce between the servers after the network
> > >> comes back up again. This is not ideal. I'm kind of hoping the
> > >> pacemaker can handle this more gracefully.
> > >
> > > It can't. The meatware/meatclient combination replaces a fencing
> > > operation. It is even expected that the node fenced is going to
> > > come up after a while.
> > >
> > >> Can pacemaker be configured to allow manual (human) confirmation that
> > >> the primary node is dead before ever switching services? (i.e. requrie
> > >> human confirmation for all cases when it cannot talk to the other
> > >> node).
> > >
> > > If your network goes yo-yo, the cluster will follow. The only
> > > way is to remove a node from the configuration or put it into
> > > standby.
> > 
> > What is the reasoning for this though?
> 
> Well, how else would you have it work? The point is that as soon
> as there is network connectivity the nodes will try to reform a
> cluster.
> 
> > Here I have pri and sec, both with meatware.
> > 
> > My expectiation:
> > Network dies, pri stays primary, sec waits for confirmation that pri
> > is dead. It never gets it.
> > Network comes back, sec sees pri is primary. All is well with the world.
> > 
> > What really happens.
> > Same, but when the network comes back, sec gets pri's resources, then
> > pri gets them back again.
> > 
> > This seems wrong.
> 
> Indeed. That shouldn't happen. If it does, please file a
> bugzilla.

The reason:

 both start "meatware" stonith,
 both do not get confirmation.

 network comes back,
 both see "dead node xyz returning after partition",
 and react by restarting the cluster software.

 where the "sec" does "defer" that cluster restart,
 until "current resource activity" has settled down,
 because had been in the process of taking over,
 only waiting for the confirmation.

 so the "pri" restarts itself, stopping all services along the way and
 informing the "sec" about that (which counts as confirmation),
 so the sec starts all services now.  once that is done,
 the "deferred" shutdown on this node will take place,
 and the services flip back.

that at least is how "heartbeat" handles
"dead node returning after partition".

And I don't think it has any choice.

Pacemaker may have much more information at hand,
so it could potentially hanlde this better,
especially since one node does not run any resources.

I don't know, I did not try.

A hack that would even work in very old haresources clusters
would be to have a first resource that will block indefinetely
until confirmation, and all other resource depend on it.

A much better approach with pacemaker would be to
simply get rid of meatware stuff,
and always have one node sitting in "standby".

The "confirmation" would then be to set it active ;)

You probably can achieve about the same level
by not even using any cluster software (appart from
data replication, probably),
but using good old runlevels.

boot into "default: 3" (equivalent to standby),
init 4 to activate.
init 3 to deactivate (for clean switchover).

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.