[Pacemaker] Occasional nonsensical resource agent errors, redux

Dejan Muhamedagic dejanmm at fastmail.fm
Mon Nov 3 15:26:34 CET 2014


Hi,

On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
> В Mon, 3 Nov 2014 13:32:45 +1100
> Andrew Beekhof <andrew at beekhof.net> пишет:
> 
> > 
> > > On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk at wawd.com> wrote:
> > > 
> > > Hi all:
> > > 
> > > In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
> > > 
> > > We're seeing similar issues with our install.  We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively.  About once a week, we'll get an error like this, out of the blue:
> > > 
> > >   Nov  1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
> > > 
> > > It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
> > > 
> > > We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
> > > 
> > > Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
> > 
> > its pretty much in the hands of the agent.
> 
> Actually the message seems to be output by check_binary() function
> which is part of framework.  

Someone complained in the IRC about this issue (another resource
agent though, I think Xen) and they said that which(1) was not
able to find the program. I'd suggest to do strace (or ltrace)
of which(1) at that point (it's in ocf-shellfuncs).

The which(1) utility is a simple tool: it splits the PATH
environment variable and stats the program name appended to each
of the paths. PATH somehow corrupted or filesystem misbehaving?
My guess is that it's the former.

BTW, was there an upgrade of some kind before this started
happening?

Thanks,

Dejan

> > you could perhaps find the call that looks for ip and wrap it in a set -x/set +x block
> > that way you'd know exactly why it thinks the binary is missing
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



More information about the Pacemaker mailing list