[Pacemaker] Recovery after simple master-master failover

Tue Feb 21 13:04:34 UTC 2012

Hi list,

We have two entry servers (running Apache on Debian Squeeze/Pacemaker 
1.0.9 with Heartbeat), both of which are active at the same time. Users 
may use any of these two servers at any time.

Now, if one of them fails, users should all be redirected to the other 
server, as transparently as possible, using two virtual IP addresses.

I absolutely don't want pacemaker interfering with Apache - All I want 
it to do is monitor Apache and move the IP address if it goes down.

Thus, I set up this configuration (simplified, IPv6 removed):

node $id="101b0c74-2fd5-46a5-bb65-702cb3188c11" entry1
node $id="6ec6b85c-c44c-406d-97aa-1a8da56dc041" entry2
primitive apache ocf:heartbeat:apache \
	params statusurl="http://localhost/server-status" \
	op monitor interval="30s" \
	meta is-managed="false"
primitive siteIp4A ocf:heartbeat:IPaddr \
	params ip="188.92.145.78" cidr_netmask="255.255.255.192" nic="eth0" \
	op monitor interval="15s"
primitive siteIp4B ocf:heartbeat:IPaddr \
	params ip="188.92.145.79" cidr_netmask="255.255.255.192" nic="eth0" \
	op monitor interval="15s"
clone apacheClone apache
colocation coloDistribute -100: siteIp4A siteIp4B
colocation coloSiteA inf: siteIp4A apacheClone
colocation coloSiteB inf: siteIp4B apacheClone
property $id="cib-bootstrap-options" \
	dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
	cluster-infrastructure="Heartbeat" \
	stonith-enabled="false" \
	last-lrm-refresh="1329758239"

Yes, I know the usual disclamer about stonith, but we don't care, 
because the worst thing that could happen is that both nodes take both 
IP addresses, which is a risk we can totally live with. Even if that 
situation happens, pacemaker recovers from it as soon as the two nodes 
see each other again.

So far, so good, failover appears to work (e.g. if I simulate a monitor 
failure using iptables to cut of the monitor), but:

1. After the failed Apache comes back up, pacemaker doesn't notice this, 
unless I do a manual resource cleanup. I think this is because the 
monitor is stopped on failure. I have played with
monitor on-fail="ignore" and "restart"
and
failure-timeout=60s
on the "apache" primitive, but no luck - The cluster doesn't notice that 
Apache is back up.
I need this to happen automatically, because monitor failures can happen 
from time to time, and I do not want to use migration-threshold because 
I really want a quick failover.
Yes, I know, I can do a cronjob doing cleanup every minute, but that 
cannot be the way to go, right? Especially since that might have other 
side effects (IPs stopped during cleanup or the like?)

2. When I reconfigure things or restart Heartbeat (and Pacemaker with 
it), the apache primitive can get into the "orphaned" state, which means 
that Pacemaker will stop it. While this may be reasonnable for the IP 
primitives, it looks like a bug for a resource with is-managed="false" 
(I mean, which part of "do not start or stop this resource" does 
Pacemaker not understand?). Unfortunately, I couldn't find any way to 
disable this behaviour except for the global "stop-orphan-actions" 
option, which is probably not what I want. Am I missing something here?

I have spent hours trying to figure out how this is supposed to work, 
but no dice :(

Any help would be greatly appreciated. Thanks!

Best regards,

David

-- 
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg at doodle.com