[Pacemaker] Wrong system send arp reply when using IPaddr

Sun Mar 17 12:17:54 EDT 2013

First off, I'm going to preface this with the realization that what I am 
explaining makes no sense, doesn't follow normal logic and I'm not a 
complete idiot. I've beaten my head against a wall with this issue for 
two days, and have made no progress, yet we've had a couple of 
production system outages because of it.

Environment is a pair of IBM x-series systems in a DMZ connected to an 
ASA5500. Each IBM box has two interfaces in a mode=4 bond connected to 
two switches, which connected to the pri/sec firewall and are 
interconnected - Poor man's redundancy I support. Both boxes run RHEL6.3 
and Pacemaker 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14. ASA 
has a ARP table timeout of 4hours.

There are about a dozen IPAddr resources in a group which are configured 
with meta ordered="false" collocated="false" - Each is independent from 
a service perspective, but the group makes it easy to manage them. Each 
box runs LVS with mangle rules, then assigns fwm values for routing 
within LVS - For whatever reason, this still requires the IP to be on 
the box receiving the packet through LVS, even if the mangle rule is 
triggered.

We've had a couple of instances for two IPs in this configuration where 
Pacemaker (and syslog) indicate the IP is assigned to box 01, yet the 
firewall receives an ARP reply from box 02. Didn't believe it at first 
until we grabbed packets from a SPAN on the switches. Correct IP address 
in reply, MAC of one of the bonded interfaces on box 02, yet the IP 
isn't on it.

We've experienced both 01 arping for an IP on 02, and 02 arping for an 
IP on 01. Last night when we had the issue, an IP was on 02, 01 arped 
for it and I tcpdumped on 01 and saw SYN packets coming in for the IP on 
01 - Makes sense, but doesn't explain why the box answered the arp in 
the first place.

I realize this likely isn't a Pacemaker issue, but I was hoping someone 
else might have experienced a similar issue, or can at least point me in 
the right direction. We have a far more complex Pacemaker/LVS 
environment on our inside network (which isn't link-local to the ASA - 
goes through an inside router) which works flawlessly, so I'm open to 
the fact that something is totally screwed up in our DMZ.

Sorry that was long. :)