[Pacemaker] Occasional nonsensical resource agent errors
Andrew Daugherity
adaugherity at tamu.edu
Tue Jul 15 20:31:38 CEST 2014
> Message: 1
> Date: Sat, 12 Jul 2014 09:42:57 -0400
> From: Ken Gaillot <kjgaillo at gleim.com>
> To: pacemaker at oss.clusterlabs.org
> Subject: [Pacemaker] Occasional nonsensical resource agent errors
> since Debian 3.2.57-3+deb7u1 kernel update
> Message-ID: <53C13B61.7080803 at gleim.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi,
>
> We run multiple deployments of corosync+pacemaker on Debian "wheezy" for
> high-availability of various resources. The configurations are unchanged
> and ran without any issues for many months. However, since we applied
> the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting
> resource agent errors on rare occasions, with error messages that are
> clearly incorrect.
>
>
> [....]
>
> Given the odd error messages from the resource agent, I suspect it's a
> memory corruption error of some sort. We've been unable to find anything
> else useful in the logs, and we'll probably end up reverting to the
> prior kernel version. But given the rarity of the issue, it would be a
> long while before we could be confident that fixed it.
>
> Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel
> or later? Has anyone had any similar issues?
Just curious, I see you're running Xen; are you setting dom0_mem? I had similar issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random memory corruption due to a kernel bug. It was mostly random but I did eventually find a repeatable test case: checksum verification of a kernel build tree with mtree; on affected systems there would usually be a few files that failed to verify.
I had been setting dom0_mem=768M, as that was a good balance between maximizing memory available for VMs while keeping enough for services in Dom0 (including pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB less than physical RAM, leaving 256M available for Xen overhead, etc. Raising it to 2048M (or not setting it at all) was a sufficient workaround to avoid the bug, but I have finally received a fixed kernel from Novell support.
Note: this fix has not yet made it into any official updates for SLES 11 -- Novell/SUSE say it will be in the next kernel version, whenever that happens. Recent openSUSE kernels are also affected (and have yet to be fixed).
-Andrew
More information about the Pacemaker
mailing list