[Pacemaker] strange Failover due to sudden "Argument list to long" error on all resource agents
Lars Ellenberg
lars.ellenberg at linbit.com
Fri May 7 17:08:37 UTC 2010
On Fri, May 07, 2010 at 03:03:39PM +0200, Dejan Muhamedagic wrote:
> Hi,
>
> On Fri, May 07, 2010 at 12:35:59PM +0200, Fabian Ruff wrote:
> > Hi,
> >
> > I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync
> > on Debian Lenny.
> > I used the latest package from the madkiss repo for the setup
> > (corosync 1.2.0, pacemaker 1.0.8).
> >
> > I will spare you all the verbose config for now and just give you an
> > overview the recource configuration:
> >
> > >gwa:~# crm_mon -1
> > >============
> > >Last updated: Fri May 7 12:10:19 2010
> > >Stack: openais
> > >Current DC: gwb - partition with quorum
> > >Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
> > >2 Nodes configured, 2 expected votes
> > >6 Resources configured.
> > >============
> > >
> > >Online: [ gwa gwb ]
> > >
> > > Master/Slave Set: drbd_disk
> > > Masters: [ gwa ]
> > > Slaves: [ gwb ]
> > > Clone Set: connectivity
> > > Started: [ gwb gwa ]
> > > fencing_gwa (stonith:external/ipmi): Started gwb
> > > fencing_gwb (stonith:external/ipmi): Started gwa
> > > Resource Group: ips
> > > ip_outside (ocf::heartbeat:IPaddr2): Started gwa
> > > ip_backup (ocf::heartbeat:IPaddr2): Started gwa
> > > ip_secure (ocf::heartbeat:IPaddr2): Started gwa
> > > ip_inside (ocf::heartbeat:IPaddr2): Started gwa
> > > ip_staging (ocf::heartbeat:IPaddr2): Started gwa
> > > firewall (lsb:firewall): Started gwa
> > > Resource Group: services
> > > filesystem (ocf::heartbeat:Filesystem): Started gwa
> > > openvpn (lsb:openvpn-cluster): Started gwa
> > > dnsmasq (lsb:dnsmasq): Started gwa
> >
> >
> > The cluster was running fairly stable for the past couple of weeks.
> >
> > But then Yesterday without any user interaction and while idle the
> > active node (gwa) failed and was subsequently stonithed by the
> > passive one (gwb) due to a strange error (at least to me) on allmost
> > all resource agents:
> >
> > >gwa:~# grep -i error /var/log/syslog-20100507
> > >May 6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
man execve:
E2BIG The total number of bytes in the environment (envp) and argument list (argv) is too large.
line (raexecocf.c:execra:178) is
execl(ra_pathname, ra_pathname, op_type, (const char *)NULL);
so it is NOT the argument list, even though perror seems to
thinks that's the more likely cause for this error.
unless "op_type" happens to be an unterminated multi kB string somehow.
(we know what ra_pathname is from the perror message).
Does lrmd accumulate setenv() somehow?
Or crmd sent to many parameters?
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
More information about the Pacemaker
mailing list