[Pacemaker] Fencing order

Mon Apr 4 22:47:42 CET 2011

Hi,

On Fri, Mar 25, 2011 at 10:58:51AM +0100, Andrew Beekhof wrote:
> On Mon, Mar 21, 2011 at 4:06 PM, Pavel Levshin <pavel at levshin.spb.ru> wrote:
> > Hi.
> >
> > Today, we had a network outage. Quite a few problems suddenly arised in out
> > setup, including crashed corosync, known notify bug in DRBD RA and some
> > problem with VirtualDomain RA timeout on stop.
> >
> > But particularly strange was fencing behaviour.
> >
> > Initially, one node (wapgw1-1) has parted from the cluster. When connection
> > was restored, corosync has died on that node. It was considered "offline
> > unclean" and was scheduled to be fenced. Fencing by HP iLO did not work
> > (currently, I do not know why). Second priority fencing method is meatware,
> > and it did take time.
> >
> > Second node, wapgw1-2, hit DRBD notify bug and failed to stop some
> > resources. It was "online unclean". It also was scheduled to be fenced. HP
> > iLO was available for this node, but it had not been STONITHed until I
> > manually confirmed STONITH for wapgw1-1.
> >
> > When I confirmed first node restart, second node was fenced automatically.

This is a very unusual case.

> > Is this ordering intended behaviour or a bug?
> 
> A little of both.
> 
> The ordering (in the PE) was added because stonithd wasn't able to
> cope with parallel fencing operations.

The only issue stonithd may have is if there are stonith
resource clones and multiple instances try to reset the same
node at the same time and, finally, the device does not support
more than one simultaneous session. Otherwise, stonithd has no
problems with multiple parallel fencing operations.

> I don't know if this is still the case for stonithd in 1.0.  Perhaps
> Dejan can comment.
> 
> Unfortunately, as you saw, this means that we fence nodes one by one -
> and that if op N fails, we never try op > N.
> 
> Ideally the ordering would be removed, lets see what Dejan has to say.

Yes, this kind of ordering is not necessary. Multiple nodes may
be fenced in parallel.

Thanks,

Dejan

> >
> > It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster.
> >
> >
> > --
> > Pavel Levshin
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker