[Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

Mon Jun 18 16:14:03 CEST 2012

18.06.2012 16:39, Phil Frost wrote:
> I'm attempting to configure an NFS cluster, and I've observed that under
> some failure conditions, resources that depend on a failed resource
> simply stop, and no migration to another node is attempted, even though
> a manual migration demonstrates the other node can run all resources,
> and the resources will remain on the good node even after the migration
> constraint is removed.
> 
> I was able to reduce the configuration to this:
> 
> node storage01
> node storage02
> primitive drbd_nfsexports ocf:pacemaker:Stateful
> primitive fs_test ocf:pacemaker:Dummy
> primitive vg_nfsexports ocf:pacemaker:Dummy
> group test fs_test
> ms drbd_nfsexports_ms drbd_nfsexports \
>         meta master-max="1" master-node-max="1" \
>         clone-max="2" clone-node-max="1" \
>         notify="true" target-role="Started"
> location l fs_test -inf: storage02
> colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) (
> drbd_nfsexports_ms:Master )

Sets (constraints with more then two members) are evaluated in the
different order.
Try
colocation colo_drbd_master inf: ( drbd_nfsexports_ms:Master ) (
vg_nfsexports ) ( test )

> property $id="cib-bootstrap-options" \
>         no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         last-lrm-refresh="1339793579"
> 
> The location constraint "l" exists only to demonstrate the problem; I
> added it to simulate the NFS server being unrunnable on one node.
> 
> To see the issue I'm experiencing, put storage01 in standby to force
> everything on storage02. fs_test will not be able to run. Now bring
> storage01, which can satisfy all the constraints, and see that no
> migration takes place. Put storage02 in standby, and everything will
> migrate to storage01 and start successfully. Take storage02 out of
> standby, and the services remain on storage01. This demonstrates that
> even though there is a clear "best" solution where all resources can
> run, Pacemaker isn't finding it.
> 
> So far, I've noticed any of the following changes will "fix" the problem:
> 
> - removing colo_drbd_master
> - removing any one resource from colo_drbd_master
> - eliminating the group "test" and referencing fs_test directly in
> constraints
> - using a simple clone instead of a master/slave pair for
> drbd_nfsexports_ms
> 
> My current understanding is that if there exists a way to run all
> resources, Pacemaker should find it and prefer it. Is that not the case?
> Maybe I need to restructure my colocation constraint somehow? Obviously
> this is a much reduced version of a more complex practical
> configuration, so I'm trying to understand the underlying mechanisms
> more than just the solution to this particular scenario.
> 
> In particular, I'm not really sure how I inspect what Pacemaker is
> thinking when it places resources. I've tried running crm_simulate -LRs,
> but I'm a little bit unclear on how to interpret the results. In the
> output, I do see this:
> 
> drbd_nfsexports:1 promotion score on storage02: 10
> drbd_nfsexports:0 promotion score on storage01: 5
> 
> those numbers seem to account for the default stickiness of 1 for
> master/slave resources, but don't seem to incorporate at all the
> colocation constraints. Is that expected?
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org