[Pacemaker] stonith-ng message in /var/log/messages

Wed Sep 29 17:57:13 EDT 2010

Ron Kerry <rkerry at ...> writes:
> I am seeing the following sequence of messages with every monitor interval for
my stonith resource.
> 
> Sep 28 10:44:01 genesis stonith-ng: [9493]: ERROR: run_stonith_agent: No
timeout set for stonith 
> operation monitor with device fence_legacy
> Sep 28 10:44:01 genesis stonith: l2network device OK.
> 
> It is unclear to me what this ERROR means as the resource itself says
everything is fine. There is a 
> monitor timeout set in the resource definition.
> 
> Distribution is SLES11SP1  (SLE11SP1-HAE).
> cluster-glue 1.0.6-0.3.7

I'm seeing the same problem ever since the latest update rollup from Novell (the
"sleshasp1-ha-update-201009" patch).  Example:
Sep 29 16:28:35 imsxen3 stonith-ng: [5182]: ERROR: run_stonith_agent: No timeout
set for stonith operation monitor with device fence_legacy
Sep 29 16:28:36 imsxen3 stonith: external/ipmi device OK.

I downgraded the cluster-glue package (and a couple others, so RPM dependencies
were still satisfied) on one machine and the messages went away on that machine,
while they're still there on the others.

To clarify -- the "no timeout set" error is logged on the machine the stonith
resource is currently running on, each time the monitor operation fires.  On the
machine I downgraded cluster-glue on, there are no such errors for any stonith
resource running on that server.

My stonith definitions (in "crm configure" format) are like this:
primitive stonith-imsxen1 stonith:external/ipmi \
	meta target-role="Started" \
	operations $id="stonith-imsxen2-operations" \
	op monitor interval="300" timeout="15" start-delay="15" \
	params hostname="imsxen1" ipaddr="10.95.12.51" userid="stonith" passwd="XXXX"
interface="lanplus"
and similarly for stonith-imsxen2 and stonith-imsxen3.  (Node names are
imsxen[123].)

STONITH works properly, aside from the annoying messages with the latest version.

Here is the RPM version comparison:
v | SLE11-HAE-SP1-Updates                 | cluster-glue   | 1.0.5-0.5.1     |
1.0.6-0.3.7       | x86_64
v | SLE11-HAE-SP1-Updates                 | libglue2       | 1.0.5-0.5.1     |
1.0.6-0.3.7       | x86_64
v | SLE11-HAE-SP1-Updates                 | libpacemaker3  | 1.1.2-0.2.1     |
1.1.2-0.6.1       | x86_64
v | SLE11-HAE-SP1-Updates                 | pacemaker      | 1.1.2-0.2.1     |
1.1.2-0.6.1       | x86_64
v | SLE11-HAE-SP1-Updates                 | pacemaker-mgmt | 2.0.0-0.2.19    |
2.0.0-0.3.10      | x86_64

I intentionally rolled back the cluster-glue package, and the others were rolled
back to satisfy dependencies.  According to the RPM changelog, the "good"
version of cluster-glue (1.0.5-0.5.1) is from Upstream version cs: 6cf2e36df9f4,
while the newer one is from cs: a146a145a3e.

While it's possible this is a problem with Novell's builds, I don't think that
to be likely, since there are no local patches in the RPM spec file.