[Pacemaker] strange Failover due to sudden "Argument list to long" error on all resource agents

Fri May 7 13:08:37 EDT 2010

On Fri, May 07, 2010 at 03:03:39PM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Fri, May 07, 2010 at 12:35:59PM +0200, Fabian Ruff wrote:
> > Hi,
> > 
> > I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync
> > on Debian Lenny.
> > I used the latest package from the madkiss repo for the setup
> > (corosync 1.2.0, pacemaker 1.0.8).
> > 
> > I will spare you all the verbose config for now and just give you an
> > overview the recource configuration:
> > 
> > >gwa:~# crm_mon -1
> > >============
> > >Last updated: Fri May  7 12:10:19 2010
> > >Stack: openais
> > >Current DC: gwb - partition with quorum
> > >Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
> > >2 Nodes configured, 2 expected votes
> > >6 Resources configured.
> > >============
> > >
> > >Online: [ gwa gwb ]
> > >
> > > Master/Slave Set: drbd_disk
> > >     Masters: [ gwa ]
> > >     Slaves: [ gwb ]
> > > Clone Set: connectivity
> > >     Started: [ gwb gwa ]
> > > fencing_gwa	(stonith:external/ipmi):	Started gwb
> > > fencing_gwb	(stonith:external/ipmi):	Started gwa
> > > Resource Group: ips
> > >     ip_outside	(ocf::heartbeat:IPaddr2):	Started gwa
> > >     ip_backup	(ocf::heartbeat:IPaddr2):	Started gwa
> > >     ip_secure	(ocf::heartbeat:IPaddr2):	Started gwa
> > >     ip_inside	(ocf::heartbeat:IPaddr2):	Started gwa
> > >     ip_staging	(ocf::heartbeat:IPaddr2):	Started gwa
> > >     firewall	(lsb:firewall):	Started gwa
> > > Resource Group: services
> > >     filesystem	(ocf::heartbeat:Filesystem):	Started gwa
> > >     openvpn	(lsb:openvpn-cluster):	Started gwa
> > >     dnsmasq	(lsb:dnsmasq):	Started gwa
> > 
> > 
> > The cluster was running fairly stable for the past couple of weeks.
> > 
> > But then Yesterday without any user interaction and while idle the
> > active node (gwa) failed and was subsequently stonithed by the
> > passive one (gwb) due to a strange error (at least to me) on allmost
> > all resource agents:
> > 
> > >gwa:~# grep -i error /var/log/syslog-20100507
> > >May  6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long

man execve:
       E2BIG  The total number of bytes in the environment (envp) and argument list (argv) is too large.

line (raexecocf.c:execra:178) is
execl(ra_pathname, ra_pathname, op_type, (const char *)NULL);

so it is NOT the argument list, even though perror seems to
thinks that's the more likely cause for this error.
unless "op_type" happens to be an unterminated multi kB string somehow.
(we know what ra_pathname is from the perror message).

Does lrmd accumulate setenv() somehow?
Or crmd sent to many parameters?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.