[Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?
Phil Frost
phil at macprofessionals.com
Mon Jun 18 15:39:48 CEST 2012
I'm attempting to configure an NFS cluster, and I've observed that under
some failure conditions, resources that depend on a failed resource
simply stop, and no migration to another node is attempted, even though
a manual migration demonstrates the other node can run all resources,
and the resources will remain on the good node even after the migration
constraint is removed.
I was able to reduce the configuration to this:
node storage01
node storage02
primitive drbd_nfsexports ocf:pacemaker:Stateful
primitive fs_test ocf:pacemaker:Dummy
primitive vg_nfsexports ocf:pacemaker:Dummy
group test fs_test
ms drbd_nfsexports_ms drbd_nfsexports \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true" target-role="Started"
location l fs_test -inf: storage02
colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) (
drbd_nfsexports_ms:Master )
property $id="cib-bootstrap-options" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
last-lrm-refresh="1339793579"
The location constraint "l" exists only to demonstrate the problem; I
added it to simulate the NFS server being unrunnable on one node.
To see the issue I'm experiencing, put storage01 in standby to force
everything on storage02. fs_test will not be able to run. Now bring
storage01, which can satisfy all the constraints, and see that no
migration takes place. Put storage02 in standby, and everything will
migrate to storage01 and start successfully. Take storage02 out of
standby, and the services remain on storage01. This demonstrates that
even though there is a clear "best" solution where all resources can
run, Pacemaker isn't finding it.
So far, I've noticed any of the following changes will "fix" the problem:
- removing colo_drbd_master
- removing any one resource from colo_drbd_master
- eliminating the group "test" and referencing fs_test directly in
constraints
- using a simple clone instead of a master/slave pair for drbd_nfsexports_ms
My current understanding is that if there exists a way to run all
resources, Pacemaker should find it and prefer it. Is that not the case?
Maybe I need to restructure my colocation constraint somehow? Obviously
this is a much reduced version of a more complex practical
configuration, so I'm trying to understand the underlying mechanisms
more than just the solution to this particular scenario.
In particular, I'm not really sure how I inspect what Pacemaker is
thinking when it places resources. I've tried running crm_simulate -LRs,
but I'm a little bit unclear on how to interpret the results. In the
output, I do see this:
drbd_nfsexports:1 promotion score on storage02: 10
drbd_nfsexports:0 promotion score on storage01: 5
those numbers seem to account for the default stickiness of 1 for
master/slave resources, but don't seem to incorporate at all the
colocation constraints. Is that expected?
More information about the Pacemaker
mailing list