[Pacemaker] NFS failover time too long when routed (not switched)

Tue Jun 26 13:39:09 CEST 2012

Gentlemen,
I have troubles with NFS failover times when the NFS clients are not
located in the same network as the HA NFS server (the connection
between the nfs client and the nfs server is routed and not switched).

The HA NFS server is a Pacemaker+Corosync cluster which consists of 2
NFS nodes (active/passive using ocf::heartbeat:exportfs) plus one
quorum node. All nodes are in the same network / VLAN.

Here is my Pacemaker configuration: http://pastebin.com/WiAmQQwX

My NFS clients are in several different networks / VLANs. All these
VLANs are managed by / connected to one single L3 switch - Cisco
Catalyst 6509.

I'm testing failover with two NFS clients:
- client-A with address in the same VLAN as the HA NFS server address.
- client-B with address in a different VLAN than as the HA NFS server address.

When I "crm resource migrate" my nfs group from the active cluster
node (irvine) to the second node (mitchell)
(the group consists of ocf:heartbeat:LVMm, ocf:heartbeat:Filesystem,
ocf:heartbeat:exportfs and ocf:heartbeat:IPaddr2), I see:

the I/O operation on client-A hangs for 10 - 15 seconds, then keeps
working. Client A is in the same VLAN as the NFS server.
the I/O operation on client-B hangs for about 5 minutes (sic!), then
keeps working. Client B is in a different VLAN, it's gets routed
(within the same switch) to reach the NFS server.

These times above happen with the clients using NFSv3 mount.

If I mount NFSv4, client B hangs for 60 seconds (as opposed to 5
minutes when using NFSv3).

Yes, you would say this must be the ARP cache.

What is the MYSTERY HERE IS:

within these 5 minutes when client-B is hanging, starting at the
moment I moved the nfs group from node irvine to node mitchell:
- when I jump on client-B (10.0.12.132) and run "ssh 10.0.10.103" (the
floating NFS server address managed by the cluster), I reach the right
node (mitchell)
- when I run "tcpdump -vv client-B" on irvine, which does not have the
"10.0.10.103" floating address anymore (it's got moved to mitchel), I
can see:

 [root at irvine ~]#  tcpdump -nvv host 10.0.10.103
 16:52:09.487087 IP (tos 0x0, ttl 63, id 34159, offset 0, flags [DF],
proto TCP (6), length 176)
    10.0.12.132.2324775663 > 10.0.10.103.2049: 120 getattr fh
Unknown/01000101650000006D0F0A000C38CF534F1F2A5F000000004FC6F267337BD97D

That tells me the L3 switch (cisco 6509) somehow remembers that there
is an established connection between its two ports (to explain the
tcpdump),
but when a new connection is made (my ssh 10.0.10.103), it does a proper

I'm jumping on the switch and running:
#show ip arp 10.0.10.103
 Protocol  Address          Age (min)  Hardware Addr   Type   Interface
 Internet  10.0.10.103             1   b499.bab7.01ca  ARPA   Vlan10

which is the MAC address of mitchell (good), but I can still see some
tcpdump packets of the "10.0.10.103" on irvine (above).

Anybody can send me the right direction, please, how to solve the 5
minutes NFSv3 hang window mystery?
I suspect the reason NFSv4 clients hang for 60 seconds is that NFSv4
uses only one port and TCP, while NFSv3 uses both UDP and TCP and many
ports?
But still, 60 seconds is still too long and I need NFSv3 too.

Thank you very much.
Marji
--
Marji Cermak, Drupal Systems Engineer, cermakm [at] gmail, www.marginaldrop.com