[Pacemaker] Fencing order

Mon Mar 21 15:06:13 UTC 2011

Hi.

Today, we had a network outage. Quite a few problems suddenly arised in 
out setup, including crashed corosync, known notify bug in DRBD RA and 
some problem with VirtualDomain RA timeout on stop.

But particularly strange was fencing behaviour.

Initially, one node (wapgw1-1) has parted from the cluster. When 
connection was restored, corosync has died on that node. It was 
considered "offline unclean" and was scheduled to be fenced. Fencing by 
HP iLO did not work (currently, I do not know why). Second priority 
fencing method is meatware, and it did take time.

Second node, wapgw1-2, hit DRBD notify bug and failed to stop some 
resources. It was "online unclean". It also was scheduled to be fenced. 
HP iLO was available for this node, but it had not been STONITHed until 
I manually confirmed STONITH for wapgw1-1.

When I confirmed first node restart, second node was fenced automatically.

Is this ordering intended behaviour or a bug?

It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster.

-- 
Pavel Levshin