[Pacemaker] Possible error in RA invocation

Tue Feb 18 10:54:45 EST 2014

----- Original Message -----
> From: "Santiago Pérez" <santiago.perez at entertainment-solutions.eu>
> To: pacemaker at oss.clusterlabs.org
> Sent: Thursday, January 30, 2014 1:50:41 PM
> Subject: [Pacemaker] Possible error in RA invocation
> 
> Hi everyone,
> 
> I am running a two-node cluster which hosts two Xen VMs. We're using
> DRBD, but it's managed directly from Xen.
> 
> The configuration of one of this resources is as follows:
> 
> primitive xen-vm1 ocf:heartbeat:Xen
>          params xmfile="/etc/xen/vm1.cfg"
>          op monitor interval="30s"
>          op start interval="0" timeout="60s"
>          op stop interval="0" timeout="300s"
>          op migrate_from interval="0" timeout="240" ingerval="0"
>          op migrate_to interval="0" timeout="240"
>          meta allow-migrate="true" target-role="Started"
>          meta target-role="Started"
> 
> 
> I have a problem with the monitor operation. It seems to be working
> fine... until it doesn't. The cluster can be running for weeks without
> any failure, but sometimes the monitor operation fails with a really
> strange error from the resource agent. This is an excerpt of one of the
> failures:
> 
> Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
> (pid 11756)
> Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: operation monitor[71] on
> xen-vm1 for client 3825: pid 11756 exited with return code 0
> Jan 28 15:40:26 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
> (pid 18065)
> Jan 28 15:40:27 xenhost1 lrmd: [3822]: info: operation monitor[71] on
> xen-vm1 for client 3825: pid 18065 exited with return code 0
> Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
> (pid 24373)
> Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: operation monitor[71] on
> xen-vm1 for client 3825: pid 24373 exited with return code 0
> Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
> (pid 30686)
> Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: operation monitor[71] on
> xen-vm1 for client 3825: pid 30686 exited with return code 0
> Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71]
> (pid 4593)
> Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: operation monitor[71] on
> xen-vm1 for client 3825: pid 4593 exited with return code 0
> Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output:
> (xen-vm1:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
> Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output:
> (xen-vm1:monitor:stderr) en-list: bad variable name

This is weird. It is almost like your shell environment is borked.  I'm not sure what is causing this.

-- Vossel

> Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output:
> (xen-vm1:monitor:stderr)
> Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: cancel_op: operation
> monitor[71] on xen-vm1 for client 3825, its parameters:
> crm_feature_set=[3.0.6] xmfile=[/etc/xen/vm1.cfg]
> CRM_meta_name=[monitor] CRM_meta_interval=[30000]
> CRM_meta_timeout=[20000]  cancelled
> Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 stop[72] (pid 6219)
>
> The machines are very low on resources, and this unnecessary migration
> is causing problems.
> 
> The systems are running Debian Wheezy with pacemaker 1.1.7-1 and
> resource-agents 3.9.2-5+deb7u1. I don't know yet if there's a problem
> with the Xen RA, the lrmd service itself or my configuration. I wasn't
> able to find any information related to this issue. Do you have any idea
> of what could be causing this? Any help will be appreciated.
> 
> Regards,
> Santiago
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>