[Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

Mon Jun 18 16:05:52 CEST 2012

----- Original Message -----
> From: "Phil Frost" <phil at macprofessionals.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, June 18, 2012 9:39:48 AM
> Subject: [Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or
> master/slave clones?
> 
> I'm attempting to configure an NFS cluster, and I've observed that
> under
> some failure conditions, resources that depend on a failed resource
> simply stop, and no migration to another node is attempted, even
> though
> a manual migration demonstrates the other node can run all resources,
> and the resources will remain on the good node even after the
> migration
> constraint is removed.
> 
> I was able to reduce the configuration to this:
> 
> node storage01
> node storage02
> primitive drbd_nfsexports ocf:pacemaker:Stateful
> primitive fs_test ocf:pacemaker:Dummy
> primitive vg_nfsexports ocf:pacemaker:Dummy
> group test fs_test

Why don't you have vg_nfsexports in the group?  Not really any point to a group with only one resource...

> ms drbd_nfsexports_ms drbd_nfsexports \
>          meta master-max="1" master-node-max="1" \
>          clone-max="2" clone-node-max="1" \
>          notify="true" target-role="Started"
> location l fs_test -inf: storage02
> colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) (
> drbd_nfsexports_ms:Master )

You need an order constraint here too... Pacemaker needs to know in what order to start/stop/promote things. Something like:

order ord_drbd_master_first drbd_nfsexports_ms:promote vg_nfsexports:start test:start

> property $id="cib-bootstrap-options" \
>          no-quorum-policy="ignore" \
>          stonith-enabled="false" \
>          dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"
>          \
>          cluster-infrastructure="openais" \
>          expected-quorum-votes="2" \
>          last-lrm-refresh="1339793579"
> 
> The location constraint "l" exists only to demonstrate the problem; I
> added it to simulate the NFS server being unrunnable on one node.
> 
> To see the issue I'm experiencing, put storage01 in standby to force
> everything on storage02. fs_test will not be able to run. Now bring
> storage01, which can satisfy all the constraints, and see that no
> migration takes place. Put storage02 in standby, and everything will
> migrate to storage01 and start successfully. Take storage02 out of
> standby, and the services remain on storage01. This demonstrates that
> even though there is a clear "best" solution where all resources can
> run, Pacemaker isn't finding it.
> 
> So far, I've noticed any of the following changes will "fix" the
> problem:
> 
> - removing colo_drbd_master
> - removing any one resource from colo_drbd_master
> - eliminating the group "test" and referencing fs_test directly in
> constraints
> - using a simple clone instead of a master/slave pair for
> drbd_nfsexports_ms
> 
> My current understanding is that if there exists a way to run all
> resources, Pacemaker should find it and prefer it. Is that not the
> case?
> Maybe I need to restructure my colocation constraint somehow?
> Obviously
> this is a much reduced version of a more complex practical
> configuration, so I'm trying to understand the underlying mechanisms
> more than just the solution to this particular scenario.

Not positive but try with the order statement added.  Might clear it up

HTH

Jake 

> 
> In particular, I'm not really sure how I inspect what Pacemaker is
> thinking when it places resources. I've tried running crm_simulate
> -LRs,
> but I'm a little bit unclear on how to interpret the results. In the
> output, I do see this:
> 
> drbd_nfsexports:1 promotion score on storage02: 10
> drbd_nfsexports:0 promotion score on storage01: 5
> 
> those numbers seem to account for the default stickiness of 1 for
> master/slave resources, but don't seem to incorporate at all the
> colocation constraints. Is that expected?
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
>