[Pacemaker] Recovery after simple master-master failover
David Gubler
dg at doodle.com
Tue Feb 21 13:04:34 UTC 2012
Hi list,
We have two entry servers (running Apache on Debian Squeeze/Pacemaker
1.0.9 with Heartbeat), both of which are active at the same time. Users
may use any of these two servers at any time.
Now, if one of them fails, users should all be redirected to the other
server, as transparently as possible, using two virtual IP addresses.
I absolutely don't want pacemaker interfering with Apache - All I want
it to do is monitor Apache and move the IP address if it goes down.
Thus, I set up this configuration (simplified, IPv6 removed):
node $id="101b0c74-2fd5-46a5-bb65-702cb3188c11" entry1
node $id="6ec6b85c-c44c-406d-97aa-1a8da56dc041" entry2
primitive apache ocf:heartbeat:apache \
params statusurl="http://localhost/server-status" \
op monitor interval="30s" \
meta is-managed="false"
primitive siteIp4A ocf:heartbeat:IPaddr \
params ip="188.92.145.78" cidr_netmask="255.255.255.192" nic="eth0" \
op monitor interval="15s"
primitive siteIp4B ocf:heartbeat:IPaddr \
params ip="188.92.145.79" cidr_netmask="255.255.255.192" nic="eth0" \
op monitor interval="15s"
clone apacheClone apache
colocation coloDistribute -100: siteIp4A siteIp4B
colocation coloSiteA inf: siteIp4A apacheClone
colocation coloSiteB inf: siteIp4B apacheClone
property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
last-lrm-refresh="1329758239"
Yes, I know the usual disclamer about stonith, but we don't care,
because the worst thing that could happen is that both nodes take both
IP addresses, which is a risk we can totally live with. Even if that
situation happens, pacemaker recovers from it as soon as the two nodes
see each other again.
So far, so good, failover appears to work (e.g. if I simulate a monitor
failure using iptables to cut of the monitor), but:
1. After the failed Apache comes back up, pacemaker doesn't notice this,
unless I do a manual resource cleanup. I think this is because the
monitor is stopped on failure. I have played with
monitor on-fail="ignore" and "restart"
and
failure-timeout=60s
on the "apache" primitive, but no luck - The cluster doesn't notice that
Apache is back up.
I need this to happen automatically, because monitor failures can happen
from time to time, and I do not want to use migration-threshold because
I really want a quick failover.
Yes, I know, I can do a cronjob doing cleanup every minute, but that
cannot be the way to go, right? Especially since that might have other
side effects (IPs stopped during cleanup or the like?)
2. When I reconfigure things or restart Heartbeat (and Pacemaker with
it), the apache primitive can get into the "orphaned" state, which means
that Pacemaker will stop it. While this may be reasonnable for the IP
primitives, it looks like a bug for a resource with is-managed="false"
(I mean, which part of "do not start or stop this resource" does
Pacemaker not understand?). Unfortunately, I couldn't find any way to
disable this behaviour except for the global "stop-orphan-actions"
option, which is probably not what I want. Am I missing something here?
I have spent hours trying to figure out how this is supposed to work,
but no dice :(
Any help would be greatly appreciated. Thanks!
Best regards,
David
--
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg at doodle.com
More information about the Pacemaker
mailing list