[Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Mon Aug 11 20:06:09 UTC 2014

----- Original Message -----
> Greetings,
> 
> We are using pacemaker and cman in a two-node cluster with no-quorum-policy:
> ignore and stonith-enabled: false on a Centos 6 system (pacemaker related
> RPM versions are listed below). We are seeing some bizarre (to us) behavior
> when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have:
> 
> 1) Fail a resource named "some-resource" started with the
> ocf:heartbeat:anything script (or others) on node01 (in our case, it's a
> master/slave resource we're pulling observations from, but it can happen on
> normal ones).
> 2) Wait for Resource to recover.
> 3) Fail node02 (reboot -nf, or power loss)
> 4) When node02 recovers, we see in /var/log/messages:
> - Quorum is recovered
> - Sending flush op to all hosts for master-some-resource,
> last-failure-some-resource, probe_complete(true),
> fail-count-some-resource(1)
> - pengine Processing failed op monitor for some-resource on node01: unknown
> error (1)
> * After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not
> see the resource agent being called at this time, on either node.
> * Sometimes, we see other operations happen that are also not being sent to
> the RA, including stop/start
> * The resource is actually happilly running on node01 throughtout this whole
> process, so there's no reason we should be seeing this failure here.
> * This issue is only seen on resources that had not yet been cleaned up.
> Resources that were 'clean' when both nodes were last online do not have
> this issue.
> 
> We noticed this originally because we are using the ClusterMon RA to report
> on different types of errors, and this is giving us false positives. Any
> thoughts on configuration issues we could be having, or if this sounds like
> a bug in pacemaker somewhere?

This is likely a bug in whatever resource-agent you are using.  There's no way
for us to know for sure without logs.

-- Vossel

> 
> Thanks!
> 
> ----
> Versions:
> ccs-0.16.2-69.el6_5.1.x86_64
> clusterlib-3.0.12.1-59.el6_5.2.x86_64
> cman-3.0.12.1-59.el6_5.2.x86_64
> corosync-1.4.1-17.el6_5.1.x86_64
> corosynclib-1.4.1-17.el6_5.1.x86_64
> fence-virt-0.2.3-15.el6.x86_64
> libqb-0.16.0-2.el6.x86_64
> modcluster-0.16.2-28.el6.x86_64
> openais-1.1.1-7.el6.x86_64
> openaislib-1.1.1-7.el6.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
> pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> pcs-0.9.90-2.el6.centos.3.noarch
> resource-agents-3.9.2-40.el6_5.7.x86_64
> ricci-0.16.2-69.el6_5.1.x86_64
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>