[Pacemaker] When the ex-live server comes back online, it tries to failback causing a failure and restart in services

Sun Feb 16 19:16:49 EST 2014

On 17 Jan 2014, at 4:33 pm, Michael Monette <mmonette at 2keys.ca> wrote:

> Hi,
> 
> I have 2 servers setup with Postgres and /dev/drbd1 is mounted at /var/lib/pgsql. I also have pacemaker setup and it's setup to fail back and forth between the 2 nodes. It works really well for the most part.
> 
> I am having this one problem and it is happening to all 4 of my clusters. If the "web_services" resource group is running on database-2.hehe.org and I do a hard reset on it, it fails over fine and within a few seconds the DB is running on database-1.hehe.org. I turn the system back on and everything is fine. It comes back online with no issue and everything continues to run normally on database-1. crm_mon shows no errors at all, the node simply goes into online status.
> 
> HOWEVER, If I do a hard shutdown on database-1(or any of my primary nodes, ldap-1,idp-1,acc-1), it fails over to database-2 just fine. But, when it comes back into online status it seems like pacemaker tries to move the resources back to database-1, fails and then the services get restarted on database-2 because they are moving back.

Check out resource-stickiness.  Set it to 100 (or so) and you should get the behaviour you want.
If not, you might find database-1 is starting pgsql or drbd at boot time.

> 
> Why is it that all of my 1st nodes are trying to take the resources back when they come back online but none of the 2nd nodes do this? Is there any way to prevent this? Can PaceMaker not check to see if the resources in the cluster are already running, and if so, just become an available node for the next time? 
> 
> I tried putting sticky resources to infinity. I have tried starting up the corosync/pacemaker service with the node in standby beforehand and it's always the same thing. Once node-1 is online, all the services on node-2 get interrupted trying to failback, which fails(probably just because drbd is already in use on the other end).
> 
> Here is my config:
> 
> node database-1.hehe.org \
>       attributes standby="off"
> node database-2.hehe.org \
>       attributes standby="off"
> primitive drbd_data ocf:linbit:drbd \
>       params drbd_resource="res1" \
>       op monitor interval="29s" role="Master" \
>       op monitor interval="31s" role="Slave"
> primitive fs_data ocf:heartbeat:Filesystem \
>       params device="/dev/drbd1" directory="/var/lib/pgsql" fstype="ext4"
> primitive httpd lsb:postgresql
> primitive ip_httpd ocf:heartbeat:IPaddr2 \
>       params ip="10.199.0.11"
> group web_services fs_data ip_httpd httpd
> ms ms_drbd_data drbd_data \
>       meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
> colocation web_services_on_drbd inf: httpd ms_drbd_data:Master
> order web_services_after_drbd inf: ms_drbd_data:promote web_services:start
> property $id="cib-bootstrap-options" \
>       dc-version="1.1.10-14.el6_5.1-368c726" \
>       cluster-infrastructure="classic openais (with plugin)" \
>       expected-quorum-votes="2" \
>       stonith-enabled="false" \
>       no-quorum-policy="ignore" \
>       last-lrm-refresh="1389926961"
> 
> Thanks
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140217/a772150e/attachment-0002.sig>