[Pacemaker] strange Failover due to sudden "Argument list to long" error on all resource agents

Fri May 7 06:35:59 EDT 2010

Hi,

I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync on 
Debian Lenny.
I used the latest package from the madkiss repo for the setup (corosync 
1.2.0, pacemaker 1.0.8).

I will spare you all the verbose config for now and just give you an 
overview the recource configuration:

> gwa:~# crm_mon -1
> ============
> Last updated: Fri May  7 12:10:19 2010
> Stack: openais
> Current DC: gwb - partition with quorum
> Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> ============
> 
> Online: [ gwa gwb ]
> 
>  Master/Slave Set: drbd_disk
>      Masters: [ gwa ]
>      Slaves: [ gwb ]
>  Clone Set: connectivity
>      Started: [ gwb gwa ]
>  fencing_gwa	(stonith:external/ipmi):	Started gwb
>  fencing_gwb	(stonith:external/ipmi):	Started gwa
>  Resource Group: ips
>      ip_outside	(ocf::heartbeat:IPaddr2):	Started gwa
>      ip_backup	(ocf::heartbeat:IPaddr2):	Started gwa
>      ip_secure	(ocf::heartbeat:IPaddr2):	Started gwa
>      ip_inside	(ocf::heartbeat:IPaddr2):	Started gwa
>      ip_staging	(ocf::heartbeat:IPaddr2):	Started gwa
>      firewall	(lsb:firewall):	Started gwa
>  Resource Group: services
>      filesystem	(ocf::heartbeat:Filesystem):	Started gwa
>      openvpn	(lsb:openvpn-cluster):	Started gwa
>      dnsmasq	(lsb:dnsmasq):	Started gwa

The cluster was running fairly stable for the past couple of weeks.

But then Yesterday without any user interaction and while idle the 
active node (gwa) failed and was subsequently stonithed by the passive 
one (gwb) due to a strange error (at least to me) on allmost all 
resource agents:

> gwa:~# grep -i error /var/log/syslog-20100507
> May  6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
> May  6 14:13:23 gwa crmd: [24899]: info: process_lrm_event: LRM operation ip_inside_monitor_20000 (call=1628, rc=1, cib-update=7524, confirmed=false) unknown error
> May  6 14:13:24 gwa lrmd: [27945]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
> May  6 14:13:24 gwa crmd: [24899]: info: process_lrm_event: LRM operation ip_staging_monitor_20000 (call=1634, rc=1, cib-update=7526, confirmed=false) unknown error
> May  6 14:13:25 gwa lrmd: [27948]: ERROR: (raexeclsb.c:execra:267) execv failed for /etc/init.d/firewall: Argument list too long
> May  6 14:13:25 gwa lrmd: [27965]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//pacemaker/ping: Argument list too long
> May  6 14:13:25 gwa crmd: [24899]: info: process_lrm_event: LRM operation ping:0_monitor_5000 (call=15, rc=1, cib-update=7530, confirmed=false) unknown error
> May  6 14:13:26 gwa lrmd: [27966]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
> May  6 14:13:26 gwa lrmd: [27967]: ERROR: (raexeclsb.c:execra:267) execv failed for /etc/init.d/openvpn-cluster: Argument list too long
> May  6 14:13:26 gwa crmd: [24899]: info: process_lrm_event: LRM operation ip_secure_monitor_20000 (call=1623, rc=1, cib-update=7531, confirmed=false) unknown error
> May  6 14:13:27 gwa lrmd: [27971]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//linbit/drbd: Argument list too long
> May  6 14:13:27 gwa lrmd: [27972]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too long
> May  6 14:13:27 gwa crmd: [24899]: info: process_lrm_event: LRM operation ip_outside_monitor_20000 (call=1618, rc=1, cib-update=7535, confirmed=false) unknown error
> May  6 14:13:27 gwa lrmd: [27973]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//heartbeat/Filesystem: Argument list too long
> May  6 14:13:27 gwa crmd: [24899]: info: process_lrm_event: LRM operation filesystem_monitor_20000 (call=1632, rc=1, cib-update=7536, confirmed=false) unknown error
> May  6 14:13:30 gwa lrmd: [27974]: ERROR: (raexecocf.c:execra:178) execl failed for /usr/lib/ocf/resource.d//pacemaker/ping: Argument list too long
> May  6 14:13:31 gwa lrmd: [27975]: ERROR: (raexeclsb.c:execra:267) execv failed for /etc/init.d/dnsmasq: Argument list too long

I'm totally clueless to what might be the cause of the error and kindly 
ask if someone can explain to me how to prevent this from happening again.

I'll happily provide more information (logs, config) if needed.
Please let me know if you need more information from me.

Cheers,
Fabian