[Pacemaker] Occasional nonsensical resource agent errors, redux

Andrei Borzenkov arvidjaar at gmail.com
Mon Nov 3 17:06:58 CET 2014


В Mon, 3 Nov 2014 15:26:34 +0100
Dejan Muhamedagic <dejanmm at fastmail.fm> пишет:

> Hi,
> 
> On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
> > В Mon, 3 Nov 2014 13:32:45 +1100
> > Andrew Beekhof <andrew at beekhof.net> пишет:
> > 
> > > 
> > > > On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk at wawd.com> wrote:
> > > > 
> > > > Hi all:
> > > > 
> > > > In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
> > > > 
> > > > We're seeing similar issues with our install.  We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively.  About once a week, we'll get an error like this, out of the blue:
> > > > 
> > > >   Nov  1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
> > > > 
> > > > It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
> > > > 
> > > > We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
> > > > 
> > > > Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
> > > 
> > > its pretty much in the hands of the agent.
> > 
> > Actually the message seems to be output by check_binary() function
> > which is part of framework.  
> 
> Someone complained in the IRC about this issue (another resource
> agent though, I think Xen) and they said that which(1) was not
> able to find the program. I'd suggest to do strace (or ltrace)
> of which(1) at that point (it's in ocf-shellfuncs).
> 
> The which(1) utility is a simple tool: it splits the PATH
> environment variable and stats the program name appended to each
> of the paths. PATH somehow corrupted or filesystem misbehaving?
> My guess is that it's the former.
> 

As it is called quite often I'd instrument have_binary to dump all
environment and variables on "which" failure for known binary as well as
rerun it under strace. Running it under strace every time would
probably result in too copious output. 

> BTW, was there an upgrade of some kind before this started
> happening?
> 
> Thanks,
> 
> Dejan
> 
> > > you could perhaps find the call that looks for ip and wrap it in a set -x/set +x block
> > > that way you'd know exactly why it thinks the binary is missing
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list