[Pacemaker] resource stickiness and preventing stonith on failback

Tue Aug 23 21:56:19 CET 2011

Hi All,

I am trying to configure pacemaker (1.0.10) to make a single filesystem
highly available by two nodes (please don't be distracted by the dangers
of multiply mounted filesystems and clustering filesystems, etc., as I
am absolutely clear about that -- consider that I am using a filesystem
resource as just an example if you wish).  Here is my filesystem
resource description:

node foo1
node foo2 \
	attributes standby="off"
primitive OST1 ocf:heartbeat:Filesystem \
	meta target-role="Started" \
	operations $id="BAR1-operations" \
	op monitor interval="120" timeout="60" \
	op start interval="0" timeout="300" \
	op stop interval="0" timeout="300" \
	params device="/dev/disk/by-uuid/8c500092-5de6-43d7-b59a-ef91fa9667b9"
directory="/mnt/bar1" fstype="ext3"
primitive st-pm stonith:external/powerman \
	params serverhost="192.168.122.1:10101" poweroff="0"
clone fencing st-pm
property $id="cib-bootstrap-options" \
	dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="1" \
	no-quorum-policy="ignore" \
	last-lrm-refresh="1306783242" \
	default-resource-stickiness="1000"
rsc_defaults $id="rsc-options" \
	resource-stickiness="100"

The two problems I have run into are:

1. preventing the resource from failing back to the node it was
   previously on after it has failed over and the previous node has
   been restored.  Basically what's documented at

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html

2. preventing the active node from being STONITHed when the resource
   is moved back to it's failed-and-restored node after a failover.
   IOW: BAR1 is available on foo1, which fails and the resource is moved
   to foo2.  foo1 returns and the resource is failed back to foo1, but
   in doing that foo2 is STONITHed.

For #1, as you can see, I tried setting the default resource stickiness
to 100.  That didn't seem to work.  When I stopped corosync on the
active node, the service failed over but it promptly failed back when I
started corosync again, contrary to the example on the referenced URL.

Subsequently I (think I) tried adding the specific resource stickiness
of 1000.  That didn't seem to help either.

As for #2, the issue with STONITHing foo2 when failing back to foo1 is
that foo1 and foo2 are an active/active pair of servers.  STONITHing
foo2 just to restore foo1's services puts foo2's services out of service,

I do want a node that is believed to be dead to be STONITHed before it's
resource(s) are failed over though.

Any hints on what I am doing wrong?

Thanx and cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110823/1a348832/attachment.sig>