[Pacemaker] Occasional nonsensical resource agent errors

Tue Jul 15 20:31:38 CEST 2014

> Message: 1
> Date: Sat, 12 Jul 2014 09:42:57 -0400
> From: Ken Gaillot <kjgaillo at gleim.com>
> To: pacemaker at oss.clusterlabs.org
> Subject: [Pacemaker] Occasional nonsensical resource agent errors
> 	since Debian 3.2.57-3+deb7u1 kernel update
> Message-ID: <53C13B61.7080803 at gleim.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hi,
> 
> We run multiple deployments of corosync+pacemaker on Debian "wheezy" for 
> high-availability of various resources. The configurations are unchanged 
> and ran without any issues for many months. However, since we applied 
> the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting 
> resource agent errors on rare occasions, with error messages that are 
> clearly incorrect.
> 
> 
> [....]
> 
> Given the odd error messages from the resource agent, I suspect it's a 
> memory corruption error of some sort. We've been unable to find anything 
> else useful in the logs, and we'll probably end up reverting to the 
> prior kernel version. But given the rarity of the issue, it would be a 
> long while before we could be confident that fixed it.
> 
> Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel 
> or later? Has anyone had any similar issues?

Just curious, I see you're running Xen; are you setting dom0_mem?  I had similar issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random memory corruption due to a kernel bug.  It was mostly random but I did eventually find a repeatable test case: checksum verification of a kernel build tree with mtree; on affected systems there would usually be a few files that failed to verify.

I had been setting dom0_mem=768M, as that was a good balance between maximizing memory available for VMs while keeping enough for services in Dom0 (including pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB less than physical RAM, leaving 256M available for Xen overhead, etc.  Raising it to 2048M (or not setting it at all) was a sufficient workaround to avoid the bug, but I have finally received a fixed kernel from Novell support.

Note: this fix has not yet made it into any official updates for SLES 11 -- Novell/SUSE say it will be in the next kernel version, whenever that happens.  Recent openSUSE kernels are also affected (and have yet to be fixed).

-Andrew