[Pacemaker] strange Failover due to sudden "Argument list to long" error on all resource agents

Mon May 10 09:05:15 UTC 2010

Hi,

On Fri, May 07, 2010 at 07:08:37PM +0200, Lars Ellenberg wrote:
> On Fri, May 07, 2010 at 03:03:39PM +0200, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Fri, May 07, 2010 at 12:35:59PM +0200, Fabian Ruff wrote:
> > > Hi,
> > > 
> > > I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync
> > > on Debian Lenny.
> > > I used the latest package from the madkiss repo for the setup
> > > (corosync 1.2.0, pacemaker 1.0.8).
> > > 
> > > I will spare you all the verbose config for now and just give you an
> > > overview the recource configuration:
> > > 
> > > >gwa:~# crm_mon -1
> > > >============
> > > >Last updated: Fri May  7 12:10:19 2010
> > > >Stack: openais
> > > >Current DC: gwb - partition with quorum
> > > >Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
> > > >2 Nodes configured, 2 expected votes
> > > >6 Resources configured.
> > > >============
> > > >
> > > >Online: [ gwa gwb ]
> > > >
> > > > Master/Slave Set: drbd_disk
> > > >     Masters: [ gwa ]
> > > >     Slaves: [ gwb ]
> > > > Clone Set: connectivity
> > > >     Started: [ gwb gwa ]
> > > > fencing_gwa	(stonith:external/ipmi):	Started gwb
> > > > fencing_gwb	(stonith:external/ipmi):	Started gwa
> > > > Resource Group: ips
> > > >     ip_outside	(ocf::heartbeat:IPaddr2):	Started gwa
> > > >     ip_backup	(ocf::heartbeat:IPaddr2):	Started gwa
> > > >     ip_secure	(ocf::heartbeat:IPaddr2):	Started gwa
> > > >     ip_inside	(ocf::heartbeat:IPaddr2):	Started gwa
> > > >     ip_staging	(ocf::heartbeat:IPaddr2):	Started gwa
> > > >     firewall	(lsb:firewall):	Started gwa
> > > > Resource Group: services
> > > >     filesystem	(ocf::heartbeat:Filesystem):	Started gwa
> > > >     openvpn	(lsb:openvpn-cluster):	Started gwa
> > > >     dnsmasq	(lsb:dnsmasq):	Started gwa
> > > 
> > > 
> > > The cluster was running fairly stable for the past couple of weeks.
> > > 
> > > But then Yesterday without any user interaction and while idle the
> > > active node (gwa) failed and was subsequently stonithed by the
> > > passive one (gwb) due to a strange error (at least to me) on allmost
> > > all resource agents:
> > > 
> > > >gwa:~# grep -i error /var/log/syslog-20100507
> > > >May  6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
> 
> man execve:
>        E2BIG  The total number of bytes in the environment (envp) and argument list (argv) is too large.
> 
> line (raexecocf.c:execra:178) is
> execl(ra_pathname, ra_pathname, op_type, (const char *)NULL);
> 
> so it is NOT the argument list, even though perror seems to
> thinks that's the more likely cause for this error.
> unless "op_type" happens to be an unterminated multi kB string somehow.
> (we know what ra_pathname is from the perror message).
> 
> Does lrmd accumulate setenv() somehow?

No, I don't think so. The set of environment variables is limited
to what is provided by the client in the message.

> Or crmd sent to many parameters?

If that's the case, then there is perhaps memory corruption in
crmd. Or the messaging layer. One unusual thing about the
configuration is that, obviously by accident, the start operation
on two fencing resources had non-zero interval.

Thanks,

Dejan

> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf