[Pacemaker] Fencing order

Fri Mar 25 05:58:51 EDT 2011

On Mon, Mar 21, 2011 at 4:06 PM, Pavel Levshin <pavel at levshin.spb.ru> wrote:
> Hi.
>
> Today, we had a network outage. Quite a few problems suddenly arised in out
> setup, including crashed corosync, known notify bug in DRBD RA and some
> problem with VirtualDomain RA timeout on stop.
>
> But particularly strange was fencing behaviour.
>
> Initially, one node (wapgw1-1) has parted from the cluster. When connection
> was restored, corosync has died on that node. It was considered "offline
> unclean" and was scheduled to be fenced. Fencing by HP iLO did not work
> (currently, I do not know why). Second priority fencing method is meatware,
> and it did take time.
>
> Second node, wapgw1-2, hit DRBD notify bug and failed to stop some
> resources. It was "online unclean". It also was scheduled to be fenced. HP
> iLO was available for this node, but it had not been STONITHed until I
> manually confirmed STONITH for wapgw1-1.
>
> When I confirmed first node restart, second node was fenced automatically.
>
> Is this ordering intended behaviour or a bug?

A little of both.

The ordering (in the PE) was added because stonithd wasn't able to
cope with parallel fencing operations.
I don't know if this is still the case for stonithd in 1.0.  Perhaps
Dejan can comment.

Unfortunately, as you saw, this means that we fence nodes one by one -
and that if op N fails, we never try op > N.

Ideally the ordering would be removed, lets see what Dejan has to say.

>
> It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster.
>
>
> --
> Pavel Levshin
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>