[Pacemaker] strange Failover due to sudden "Argument list to long" error on all resource agents

Mon May 10 05:09:04 EDT 2010

On Mon, May 10, 2010 at 11:05 AM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi,
>
> On Fri, May 07, 2010 at 07:08:37PM +0200, Lars Ellenberg wrote:
>> On Fri, May 07, 2010 at 03:03:39PM +0200, Dejan Muhamedagic wrote:
>> > Hi,
>> >
>> > On Fri, May 07, 2010 at 12:35:59PM +0200, Fabian Ruff wrote:
>> > > Hi,
>> > >
>> > > I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync
>> > > on Debian Lenny.
>> > > I used the latest package from the madkiss repo for the setup
>> > > (corosync 1.2.0, pacemaker 1.0.8).
>> > >
>> > > I will spare you all the verbose config for now and just give you an
>> > > overview the recource configuration:
>> > >
>> > > >gwa:~# crm_mon -1
>> > > >============
>> > > >Last updated: Fri May  7 12:10:19 2010
>> > > >Stack: openais
>> > > >Current DC: gwb - partition with quorum
>> > > >Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
>> > > >2 Nodes configured, 2 expected votes
>> > > >6 Resources configured.
>> > > >============
>> > > >
>> > > >Online: [ gwa gwb ]
>> > > >
>> > > > Master/Slave Set: drbd_disk
>> > > >     Masters: [ gwa ]
>> > > >     Slaves: [ gwb ]
>> > > > Clone Set: connectivity
>> > > >     Started: [ gwb gwa ]
>> > > > fencing_gwa     (stonith:external/ipmi):        Started gwb
>> > > > fencing_gwb     (stonith:external/ipmi):        Started gwa
>> > > > Resource Group: ips
>> > > >     ip_outside  (ocf::heartbeat:IPaddr2):       Started gwa
>> > > >     ip_backup   (ocf::heartbeat:IPaddr2):       Started gwa
>> > > >     ip_secure   (ocf::heartbeat:IPaddr2):       Started gwa
>> > > >     ip_inside   (ocf::heartbeat:IPaddr2):       Started gwa
>> > > >     ip_staging  (ocf::heartbeat:IPaddr2):       Started gwa
>> > > >     firewall    (lsb:firewall): Started gwa
>> > > > Resource Group: services
>> > > >     filesystem  (ocf::heartbeat:Filesystem):    Started gwa
>> > > >     openvpn     (lsb:openvpn-cluster):  Started gwa
>> > > >     dnsmasq     (lsb:dnsmasq):  Started gwa
>> > >
>> > >
>> > > The cluster was running fairly stable for the past couple of weeks.
>> > >
>> > > But then Yesterday without any user interaction and while idle the
>> > > active node (gwa) failed and was subsequently stonithed by the
>> > > passive one (gwb) due to a strange error (at least to me) on allmost
>> > > all resource agents:
>> > >
>> > > >gwa:~# grep -i error /var/log/syslog-20100507
>> > > >May  6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
>>
>> man execve:
>>        E2BIG  The total number of bytes in the environment (envp) and argument list (argv) is too large.
>>
>> line (raexecocf.c:execra:178) is
>> execl(ra_pathname, ra_pathname, op_type, (const char *)NULL);
>>
>> so it is NOT the argument list, even though perror seems to
>> thinks that's the more likely cause for this error.
>> unless "op_type" happens to be an unterminated multi kB string somehow.
>> (we know what ra_pathname is from the perror message).
>>
>> Does lrmd accumulate setenv() somehow?
>
> No, I don't think so. The set of environment variables is limited
> to what is provided by the client in the message.
>
>> Or crmd sent to many parameters?
>
> If that's the case, then there is perhaps memory corruption in
> crmd. Or the messaging layer. One unusual thing about the
> configuration is that, obviously by accident, the start operation
> on two fencing resources had non-zero interval.

Just in the config or did it make it into the lrmd like that?

Because crmd/lrm.c has:

	if(op->interval != 0) {
		if(safe_str_eq(operation, CRMD_ACTION_START)
		   || safe_str_eq(operation, CRMD_ACTION_STOP)) {
			crm_err("Start and Stop actions cannot have an interval: %d", op->interval);
			op->interval = 0;
		}
	}