[Pacemaker] Occasional nonsensical resource agent errors since Debian 3.2.57-3+deb7u1 kernel update
Ken Gaillot
kjgaillo at gleim.com
Sat Jul 12 13:42:57 UTC 2014
Hi,
We run multiple deployments of corosync+pacemaker on Debian "wheezy" for
high-availability of various resources. The configurations are unchanged
and ran without any issues for many months. However, since we applied
the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting
resource agent errors on rare occasions, with error messages that are
clearly incorrect.
The incidents have happened four times on two unrelated clusters:
* Our cluster hosts "talos" and "pomona" use pacemaker to manage a few
virtual IP adresses using the ocf:heartbeat:IPaddr2 resource agent. This
one has had two incidents. The first incident began with this error:
Jun 2 17:30:16 pomona lrmd: [2145]: info: RA output:
(ldap-ip:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1:
/usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied
The second incident began with this error:
Jul 12 08:36:15 talos IPaddr2[21294]: ERROR: Setup problem: couldn't
find command: ip
I can confidently say, the permissions of IPaddr2 and the location of
the "ip" command, did not change at any point!
* Our cluster hosts "aries" and "taurus" use pacemaker in a more
complicated setup, managing Xen virtual machines on shared storage
utilizing DRBD and CLVM, using the resource agents
ocf:pacemaker:controld, ocf:gleim:clvmd (which is the stock clvmd
resource agent from a later pacemaker version than is included in
wheezy), ocf:heartbeat:LVM, ocf:linbit:drbd, and ocf:gleim:Xen (which is
the stock Xen resource agent with a trivial one-line change for a local
workaround).
This cluster had also had two incidents:
* The first began with:
Jun 16 10:38:15 aries lrmd: [3646]: info: RA output:
(jabber:monitor:stderr) /usr/lib/ocf/resource.d//gleim/Xen: 71: local:
en-list: bad variable name
There is no variable "en-list" in the resource agent; the closest string
in the file is "xen-list", which is a binary not a variable, used like this:
...
if have_binary xen-list; then
xen-list $1 2>/dev/null | grep -qs "State.*[-r][-b][-p]--" 2>/dev/null
...
* The second began with:
Jun 21 11:58:58 taurus Xen[9052]: ERROR: Setup problem: couldn't find
command: awk
Again, the location of "awk" has not changed.
We have no reason to suspect the kernel update other than timing, and
the fact that the incidents occur on unrelated clusters. We have since
upgraded to Debian's next update, 3.2.57-3+deb7u2, but the most recent
incident occurred after that. The original update included fixes for
these issues:
CVE-2014-0196
Jiri Slaby discovered a race condition in the pty layer, which could
lead to denial of service or privilege escalation.
CVE-2014-1737 / CVE-2014-1738
Matthew Daley discovered that missing input sanitising in the
FDRAWCMD ioctl and an information leak could result in privilege
escalation.
CVE-2014-2851
Incorrect reference counting in the ping_init_sock() function allows
denial of service or privilege escalation.
CVE-2014-3122
Incorrect locking of memory can result in local denial of service.
Given the odd error messages from the resource agent, I suspect it's a
memory corruption error of some sort. We've been unable to find anything
else useful in the logs, and we'll probably end up reverting to the
prior kernel version. But given the rarity of the issue, it would be a
long while before we could be confident that fixed it.
Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel
or later? Has anyone had any similar issues?
-- Ken Gaillot <kjgaillo at gleim.com>
Gleim NOC
More information about the Pacemaker
mailing list