[Pacemaker] Preventing Automatic Failback
Michael Monette
mmonette at 2keys.ca
Wed Jan 22 04:13:34 UTC 2014
I just wanted to update you on this
I checked into the scores you were talking about, and I really thought you were right! I set the score on node-1 to 1001 through the DRBD linbit script on node-1 and left the other at 10000. Restarting node-1 made it all happen again. I even realized that this was happening even node-2 was currently active and I restarted node-1.
I don't want to get confusing but i'm almost certain the problem is with my DRBD setup. If both nodes are online and working, I can run a "service drbd restart" on node-2 and it restarts fine. In the same scenario I run a "service drbd restart" on node-1, it waits and counts down waiting for the node-2 to appear. Then I must service drbd restart on node-2 and bang, drbd starts in both places. (for some reason node-1 waits for node-2 to be restarted but node-2 does not).
I also noticed too that when node-1 is coming online, I see this for roughly 15-20 seconds:
Resource Group: jira_services
drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Started [ node-2.mycompany.com node-1.mycompany.com ]
drbd2_var-atlassian (ocf::heartbeat:Filesystem): Started [ node-2.mycompany.com node-1.mycompany.com ]
Both hostnames are there because when node-1 comes back up, it starts the DRBD service and because of the problem I am having, sits and waits for 20 seconds until the drbd service on node-2 get's restarted. I am pretty sure I am just lucky that there is a default setting to just restart resources on fails..otherwise I doubt I would have even gotten this far, lol.
I don't see why drbd on node-2 can be restarted and it resyncs fine, but if I want to restart the drbd service on node-1, I also must do it on node-2.. Anyway, new problem now but I think I got to the bottom of it.
Thanks
Mike
----- Original Message -----
From: "Michael Monette" <mmonette at 2keys.ca>
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
Sent: Tuesday, January 21, 2014 11:24:35 AM
Subject: Re: [Pacemaker] Preventing Automatic Failback
Also one final thing I want to add.
Corosync and pacemaker are enabled with chkconfig. So a hard reboot is esentually restarting the services too. The moment pacemaker is started at boot, this happens. (Although I've tried disabling and manually starting the services after I recover the server and same issue)
Thanks again
Mike
David Vossel <dvossel at redhat.com> wrote:
----- Original Message -----
From: "Michael Monette" <mmonette at 2keys.ca>
To: pacemaker at oss.clusterlabs.org
Sent: Monday, January 20, 2014 8:22:25 AM
Subject: [Pacemaker] Preventing Automatic Failback
Hi,
I posted this question before but my question was a bit unclear.
I have 2 nodes with DRBD with Postgresql.
When node-1 fails, everything fails to node-2 . But when node 1 is recovered,
things try to failback to node-1 and all the services running on node-2 get
disrupted(things don't ACTUALLY fail back to node-1..they try, fail, and
then all services on node-2 are simply restarted..very annoying). This does
not happen if I perform the same tests on node-2! I can reboot node-2,
things fail to node-1 and node-2 comes online and waits until he is
needed(this is what I want!) It seems to only affect my node-1's.
I have tried to set resource stickiness, I have tried everything I can really
think of, but whenever the Primary has recovered, it will always disrupt
services running on node-2.
Also I tried removing things from this config to try and isolate this. At one
point I removed the atlassian_jira and drbd2_var primitives and only had a
failover-ip and drbd1_opt, but still had the same problem. Hopefully someone
can pinpoint this out for me. If I can't really avoid this, I would at least
like to make this "bug" or whatever happen on node-2 instead of the actives.
I bet this is due to the drbd resource's master score value on node1 being higher than node2. When you recover node1, are you actually rebooting that node? If node1 doesn't lose membership from the cluster (reboot), those transient attributes that the drbd agent uses to specify which node will
be the master instance will stick around. Otherwise if you are just putting node1 in standby and then bringing the node back online, the I believe the resources will come back if the drbd master was originally on node1.
If you provide a policy engine file that shows the unwanted transition from node2 back to node1, we'll be able to tell you exactly why it is occurring.
-- Vossel
Here is my config:
node node-1.comp.com \
attributes standby="off"
node node-1.comp.com \
attributes standby="off"
primitive atlassian_jira lsb:jira \
op start interval="0" timeout="240" \
op stop interval="0" timeout="240"
primitive drbd1_opt ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/opt/atlassian"
fstype="ext4"
primitive drbd2_var ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/var/atlassian" fstype="ext4"
primitive drbd_data ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
primitive failover-ip ocf:heartbeat:IPaddr2 \
params ip=" 10.199.0.13 "
group jira_services drbd1_opt drbd2_var failover-ip atlassian_jira
ms ms_drbd_data drbd_data \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation jira_services_on_drbd inf: atlassian_jira ms_drbd_data:Master
order jira_services_after_drbd inf: ms_drbd_data:promote jira_services:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-14.el6_5.1-368c726" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
stonith-enabled="false"
\
no-quorum-policy="ignore" \
last-lrm-refresh="1390183165" \
default-resource-stickiness="INFINITY"
rsc_defaults $id="rsc-options" \
resource-stickiness="INFINITY"
Thanks
Mike
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list