[Pacemaker] Fencing order
andrew at beekhof.net
Fri Mar 25 05:58:51 EDT 2011
On Mon, Mar 21, 2011 at 4:06 PM, Pavel Levshin <pavel at levshin.spb.ru> wrote:
> Today, we had a network outage. Quite a few problems suddenly arised in out
> setup, including crashed corosync, known notify bug in DRBD RA and some
> problem with VirtualDomain RA timeout on stop.
> But particularly strange was fencing behaviour.
> Initially, one node (wapgw1-1) has parted from the cluster. When connection
> was restored, corosync has died on that node. It was considered "offline
> unclean" and was scheduled to be fenced. Fencing by HP iLO did not work
> (currently, I do not know why). Second priority fencing method is meatware,
> and it did take time.
> Second node, wapgw1-2, hit DRBD notify bug and failed to stop some
> resources. It was "online unclean". It also was scheduled to be fenced. HP
> iLO was available for this node, but it had not been STONITHed until I
> manually confirmed STONITH for wapgw1-1.
> When I confirmed first node restart, second node was fenced automatically.
> Is this ordering intended behaviour or a bug?
A little of both.
The ordering (in the PE) was added because stonithd wasn't able to
cope with parallel fencing operations.
I don't know if this is still the case for stonithd in 1.0. Perhaps
Dejan can comment.
Unfortunately, as you saw, this means that we fence nodes one by one -
and that if op N fails, we never try op > N.
Ideally the ordering would be removed, lets see what Dejan has to say.
> It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster.
> Pavel Levshin
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
More information about the Pacemaker