[Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

Tue Jun 19 20:31:18 UTC 2012

----- Original Message -----
> From: "Phil Frost" <phil at macprofessionals.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, June 18, 2012 8:39:48 AM
> Subject: [Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or
> master/slave clones?
> 
> I'm attempting to configure an NFS cluster, and I've observed that
> under
> some failure conditions, resources that depend on a failed resource
> simply stop, and no migration to another node is attempted, even
> though
> a manual migration demonstrates the other node can run all resources,
> and the resources will remain on the good node even after the
> migration
> constraint is removed.
> 
> I was able to reduce the configuration to this:
> 
> node storage01
> node storage02
> primitive drbd_nfsexports ocf:pacemaker:Stateful
> primitive fs_test ocf:pacemaker:Dummy
> primitive vg_nfsexports ocf:pacemaker:Dummy
> group test fs_test
> ms drbd_nfsexports_ms drbd_nfsexports \
>          meta master-max="1" master-node-max="1" \
>          clone-max="2" clone-node-max="1" \
>          notify="true" target-role="Started"
> location l fs_test -inf: storage02
> colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) (
> drbd_nfsexports_ms:Master )
> property $id="cib-bootstrap-options" \
>          no-quorum-policy="ignore" \
>          stonith-enabled="false" \
>          dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"
>          \
>          cluster-infrastructure="openais" \
>          expected-quorum-votes="2" \
>          last-lrm-refresh="1339793579"
> 
> The location constraint "l" exists only to demonstrate the problem; I
> added it to simulate the NFS server being unrunnable on one node.
> 
> To see the issue I'm experiencing, put storage01 in standby to force
> everything on storage02. fs_test will not be able to run. Now bring
> storage01, which can satisfy all the constraints, and see that no
> migration takes place. Put storage02 in standby, and everything will
> migrate to storage01 and start successfully. Take storage02 out of
> standby, and the services remain on storage01. This demonstrates that
> even though there is a clear "best" solution where all resources can
> run, Pacemaker isn't finding it.

Can you attach a crm_report of what happens when you put the two nodes in standby please?  Being able to see the xml and how the policy engine evaluates the transitions is helpful.

-- Vossel