[Pacemaker] Occasional nonsensical resource agent errors, redux

Tue Nov 4 17:21:45 UTC 2014

On 11/04/2014 11:02 AM, Dejan Muhamedagic wrote:
>>>>>> On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk at wawd.com> wrote:
>>>>>>
>>>>>> Hi all:
>>>>>>
>>>>>> In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
>>>>>>

<snip>

>> I was hoping to have something useful before posting another update, but
>> since it's come up again, here's what we've found so far:
>>
>> * The most common manifestation is the "couldn't find command" error. In
>> various instances it "couldn't find" xm, ip or awk. However, we've seen
>> two other variations:
>>
>>    lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: bad
>> variable name
>>
>> and
>>
>>    lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr)
>> /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1:
>> /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied
>>
>> The RA in the first case does not use the string "en-destroy" at all; it
>> does call a command "xen-destroy". That, to me, is a strong suggestion
>> of memory corruption somewhere, whether in the RA, the shell, lrmd or a
>> library used by one of those.
>
> Scary. Shell and lrmd are two obvious candidates. I assume that
> none of them would cause a segfault if trampling through the
> memory where a copy of the running script resides.
>
>> * I have not found any bugs in the RA or its included files.
>>
>> * I tried setting "debug: on" in corosync.conf, but that did not give
>> any additional useful information. The resource agent error is still the
>> first unusual message in the sequence. Here is an example, giving one
>> successful monitor run and then an occurrence of the issue (the nodes
>> are a pair of Xen dom0s including pisces, running two Xen domU resources
>> pan and nemesis):
>>
>> Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372)
>> Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resetting
>> scheduler class to SCHED_OTHER
>> Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid
>> 409)
>> Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resetting
>> scheduler class to SCHED_OTHER
>> Sep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on pan
>> for client 3512: pid 372 exited with return code 0
>> Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] on
>> nemesis for client 3512: pid 409 exited with return code 0
>> Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455)
>> Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resetting
>> scheduler class to SCHED_OTHER
>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
>> (pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
>
> This "local" seems to be from ocf-binaries:have_binary():
>
>   71     local bin=`echo $1 | sed -e 's/ -.*//'`

Agreed, nothing unusual there, reinforces suspicion of memory corruption.

>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
>> (pan:monitor:stderr) en-destroy: bad variable name
>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr)
>> Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan
>> for client 3512: pid 455 exited with return code 2
>>
>> * I tried reverting several security updates applied in the month or so
>> before we first saw the issue. Reverting the Debian kernel packages to
>> 3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 to
>> libxml2 2.8.0+dfsg1-7+nmu2.
>
> I suppose that you restarted the cluster stack after update :)

Yep, w/reboot for kernel reverts. Some of the libxml2 reverts got full 
reboots as well b/c they were done with other maintenance.

>>   None of the other updates from that time
>> look like they could have any effect.
>>
>> * Regarding libxml2, I did find that Debian had backported an upstream
>> patch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruption
>> bug, which upstream later corrected (the bug never made it into an
>> upstream release, but Debian had backported a specific changeset). I
>> submitted that as Debian Bug #765770 which was just fixed last week. I
>> haven't had a chance to apply that to the affected servers yet, but as
>> mentioned above, reverting to the libxml2 before the introduced bug did
>> not fix the issue.
>>
>> * I have not found a way to intentionally reproduce the issue. :-( We
>> have had 10 occurrences across 3 two-node clusters in five months. Some
>> of the nodes have had only one occurrence during that time, but one pair
>> gets the most of them. With the time between occurrences, it's hard to
>> do something like strace on lrmd, though that's probably a good way
>> forward, scripting something to deal with the output reasonably.
>
> Perhaps dumping core of both lrmd and the shell when this
> happens would help. Are the most affected nodes in any way
> significantly different from the others? By CIB size perhaps?

It's actually simpler. An overview of our setup is:

Cluster #1 (with 6 of the 10 failures): Xen dom0s as nodes, two Xen 
domUs as resources (DRBD is configured outside pacemaker, Xen handles 
primary/secondary roles)

Cluster #2 (2/10 failures): Xen dom0s as nodes, DLM resource, CLVM 
resource, DRBD resource, LVM volume group resource, 10 Xen domU resources

Cluster #3 (2/10 failures): Xen domUs as nodes, DNS daemon resource, 
LDAP daemon resource, (2x) IPaddr2 resources

>> * There does not seem to be any correlation with how long the node has
>> been up. Checking RAM usage of corosync and lrmd on all nodes over about
>> two weeks shows little to no change, so I don't suspect a leak. Most of
>> our errors have occurred in the Xen RA, but probably only because that's
>> the RA we use most; we've also seen it in IPaddr2.
>>
>> * My next idea would be to compile/install the latest versions of at
>> least pacemaker and the resource agents. However I am in the middle of
>> changing jobs, and unfortunately do not have much time left for this. My
>> new job will have plenty of time to spend on pacemaker ;-) so I may be
>> able to give updates later. Debian's "jessie" release freezes this week,
>> so I'm hoping that I will have time to at least get a test cluster up
>> running the somewhat newer versions in that (pacemaker 1.1.10, corosync
>> 1.4.6).
>
> Did you open a bug report with debian?

Not for the resource agent issue. I'd have to narrow it down to a 
particular package, and rule out (or reference) an upstream bug.

-- Ken Gaillot <kjgaillo at gleim.com>
Network Operations Center, Gleim Publications