[Pacemaker] Wrong system send arp reply when using IPaddr
David Coulson
david at davidcoulson.net
Sun Mar 17 16:17:54 UTC 2013
First off, I'm going to preface this with the realization that what I am
explaining makes no sense, doesn't follow normal logic and I'm not a
complete idiot. I've beaten my head against a wall with this issue for
two days, and have made no progress, yet we've had a couple of
production system outages because of it.
Environment is a pair of IBM x-series systems in a DMZ connected to an
ASA5500. Each IBM box has two interfaces in a mode=4 bond connected to
two switches, which connected to the pri/sec firewall and are
interconnected - Poor man's redundancy I support. Both boxes run RHEL6.3
and Pacemaker 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14. ASA
has a ARP table timeout of 4hours.
There are about a dozen IPAddr resources in a group which are configured
with meta ordered="false" collocated="false" - Each is independent from
a service perspective, but the group makes it easy to manage them. Each
box runs LVS with mangle rules, then assigns fwm values for routing
within LVS - For whatever reason, this still requires the IP to be on
the box receiving the packet through LVS, even if the mangle rule is
triggered.
We've had a couple of instances for two IPs in this configuration where
Pacemaker (and syslog) indicate the IP is assigned to box 01, yet the
firewall receives an ARP reply from box 02. Didn't believe it at first
until we grabbed packets from a SPAN on the switches. Correct IP address
in reply, MAC of one of the bonded interfaces on box 02, yet the IP
isn't on it.
We've experienced both 01 arping for an IP on 02, and 02 arping for an
IP on 01. Last night when we had the issue, an IP was on 02, 01 arped
for it and I tcpdumped on 01 and saw SYN packets coming in for the IP on
01 - Makes sense, but doesn't explain why the box answered the arp in
the first place.
I realize this likely isn't a Pacemaker issue, but I was hoping someone
else might have experienced a similar issue, or can at least point me in
the right direction. We have a far more complex Pacemaker/LVS
environment on our inside network (which isn't link-local to the ASA -
goes through an inside router) which works flawlessly, so I'm open to
the fact that something is totally screwed up in our DMZ.
Sorry that was long. :)
More information about the Pacemaker
mailing list