[Pacemaker] Avoid one node from being a target for resources migration

Mon Jan 12 21:18:43 EST 2015

> On 13 Jan 2015, at 4:25 am, David Vossel <dvossel at redhat.com> wrote:
> 
> 
> 
> ----- Original Message -----
>> Hello.
>> 
>> I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2 are
>> DRBD master-slave, also they have a number of other services installed
>> (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no
>> DRBD/postgresql/... are installed at it, only corosync+pacemaker.
>> 
>> But when I add resources to the cluster, a part of them are somehow moved to
>> node3 and since then fail. Note than I have a "colocation" directive to
>> place these resources to the DRBD master only and "location" with -inf for
>> node3, but this does not help - why? How to make pacemaker not run anything
>> at node3?
>> 
>> All the resources are added in a single transaction: "cat config.txt | crm -w
>> -f- configure" where config.txt contains directives and "commit" statement
>> at the end.
>> 
>> Below are "crm status" (error messages) and "crm configure show" outputs.
>> 
>> 
>> root at node3:~# crm status
>> Current DC: node2 (1017525950) - partition with quorum
>> 3 Nodes configured
>> 6 Resources configured
>> Online: [ node1 node2 node3 ]
>> Master/Slave Set: ms_drbd [drbd]
>> Masters: [ node1 ]
>> Slaves: [ node2 ]
>> Resource Group: server
>> fs (ocf::heartbeat:Filesystem): Started node1
>> postgresql (lsb:postgresql): Started node3 FAILED
>> bind9 (lsb:bind9): Started node3 FAILED
>> nginx (lsb:nginx): Started node3 (unmanaged) FAILED
>> Failed actions:
>> drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
>> last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
>> installed
>> postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
>> last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
>> error
>> bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
>> last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown
>> error
>> nginx_stop_0 (node=node3, call=767, rc=5, status=complete, last-rc-change=Mon
>> Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed
> 
> Here's what is going on. Even when you say "never run this resource on node3"
> pacemaker is going to probe for the resource regardless on node3 just to verify
> the resource isn't running.
> 
> The failures you are seeing "monitor_0 failed" indicate that pacemaker failed
> to be able to verify resources are running on node3 because the related 
> packages for the resources are not installed. Given pacemaker's default
> behavior I'd expect this.
> 
> You have two options.
> 
> 1. install the resource related packages on node3 even though you never want
> them to run there. This will allow the resource-agents to verify the resource
> is in fact inactive.

or 1b. delete the agent too.  recent versions of pacemaker should handle this case correctly.

> 
> 2. If you are using the current master branch of pacemaker, there's a new
> location constraint option called 'resource-discovery=always|never|exclusive'.
> If you add the 'resource-discovery=never' option to your location constraint
> that attempts to keep resources from node3, you'll avoid having pacemaker
> perform the 'monitor_0' actions on node3 as well.
> 
> -- Vossel
> 
>> 
>> root at node3:~# crm configure show | cat
>> node $id="1017525950" node2
>> node $id="13071578" node3
>> node $id="1760315215" node1
>> primitive drbd ocf:linbit:drbd \
>> params drbd_resource="vlv" \
>> op start interval="0" timeout="240" \
>> op stop interval="0" timeout="120"
>> primitive fs ocf:heartbeat:Filesystem \
>> params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
>> options="noatime,nodiratime" fstype="xfs" \
>> op start interval="0" timeout="300" \
>> op stop interval="0" timeout="300"
>> primitive postgresql lsb:postgresql \
>> op monitor interval="10" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> primitive bind9 lsb:bind9 \
>> op monitor interval="10" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> primitive nginx lsb:nginx \
>> op monitor interval="10" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> group server fs postgresql bind9 nginx
>> ms ms_drbd drbd meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> location loc_server server rule $id="loc_server-rule" -inf: #uname eq node3
>> colocation col_server inf: server ms_drbd:Master
>> order ord_server inf: ms_drbd:promote server:start
>> property $id="cib-bootstrap-options" \
>> stonith-enabled="false" \
>> last-lrm-refresh="1421079189" \
>> maintenance-mode="false"
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org