[Pacemaker] nfs4 cluster fail-over stops working once I introduce ipaddr2 resource

Fri Feb 14 16:33:21 EST 2014

----- Original Message -----
> From: "Dennis Jacobfeuerborn" <dennisml at conversis.de>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, February 13, 2014 11:18:04 PM
> Subject: Re: [Pacemaker] nfs4 cluster fail-over stops working once I introduce ipaddr2 resource
> 
> On 14.02.2014 02:50, Dennis Jacobfeuerborn wrote:
> > Hi,
> > I'm still working on my NFSv4 cluster and things are working as
> > expected...as long as I don't add an IPAddr2 resource.
> >
> > The DRBD, filesystem and exportfs resources work fine and when I put the
> > active node into standby everything fails over as expected.
> >
> > Once I add a VIP as a IPAddr2 resource however I seem to get monitor
> > problems with the p_exportfs_root resource.
> >
> > I've attached the configuration, status and a log file.
> >
> > The transition status is the status a moment after I take nfs1
> > (192.168.100.41) offline. It looks like the stopping of p_ip_nfs does
> > something to the p_exportfs_root resource although I have no idea what
> > that could be.
> >
> > The final status is the status after the cluster has settled. The
> > fail-over finished but the failed action is still present and cannot be
> > cleared with a "crm resource cleanup p_exportfs_root".
> >
> > The log is the result of a "tail -f" on the corosync.log from the moment
> > before I issued the "crm node standby nfs1" to when the cluster has
> > settled.
> >
> > Does anybody know what the issue could be here? At first I thought that
> > using a VIP from the same network as the cluster nodes could be an issue
> > but when I change this to use an IP in a different network
> > 192.168.101.43/24 the same thing happens.
> >
> > The moment I remove p_ip_nfs from the configuration again fail-over back
> > and forth works without a hitch.
> 
> So after a lot of digging I think I pinpointed the issue: A race between
> the monitoring and stop actions of the exportfs resource script.
> 
> When "wait_for_leasetime_on_stop" is set the following happens for the
> stop action and in this specific order:
> 
> 1. The directory is unexported
> 2. Sleep nfs lease time + 2 seconds
> 
> The problem seems to be that during the sleep phase the monitoring
> action is still invoked and since the directory has already been
> unexported it reports a failure.
> 
> Once I add enabled="false" to the monitoring action of the exportfs
> resource the problem disappears.
> 
> The question is how to ensure that the monitoring action is not called
> while the stop action is still sleeping?
> 
> Would it be a solution to create a lock file for the duration of the
> sleep and check for that lock file in the monitoring action?
> 
> I'm not 100% sure if this analysis is correct because if monitoring

right, I doubt that is happening.

What happens if you put the ip before the nfs server.

group p_ip_nfs g_nfs p_fs_data p_exportfs_root p_exportfs_data

Without drbd, I have a scenario I test for active/passive nfs server here that works for me. https://github.com/davidvossel/phd/blob/master/scenarios/nfs-basic.scenario I'm using the actual nfsserver ocf script from the latest resource-agent github branch.

-- Vossel

> calls are still made while the stop action is running this sounds
> inherently racy and would probably be an issue for almost all resource
> scripts not just exportfs.
> 
> Regards,
>    Dennis
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>