[Pacemaker] Info on failcount automatic reset

Tue Jun 24 23:28:16 UTC 2014

On 20 Jun 2014, at 11:29 pm, Gianluca Cecchi <gianluca.cecchi at gmail.com> wrote:

> Hello,
> when the monitor action for a resource times out I think its failcount is incremented by 1, correct?
> If so, suppose the next monitor action succeeds, does the failcount value automatically resets to zero or does it stay to 1?
> In the last case, is there any way to configure the cluster to automatically reset it when the following scheduled monitor completes ok? or is it a job for the administrator to monitor failcount (eg with crm_mon output) and then cleanup resource after checking all is ok and resetting so the failcount value?
> 
> I ask because on a SLES 11 SP2 cluster from which I only got the logs I have these kind of messages:
> 
> Jun 15 00:01:18 node2 pengine: [4330]: notice: common_apply_stickiness: my_resource can fail 1 more times on node2 before being forced off
> ...
> Jun 15 03:38:42 node2 lrmd: [4328]: WARN: my_resource:monitor process (PID 27120) timed out (try 1).  Killing with signal SIGTERM (15).
> Jun 15 03:38:42 node2 lrmd: [4328]: WARN: operation monitor[29] on my_resource for client 4331: pid 27120 timed out
> Jun 15 03:38:42 node2 crmd: [4331]: ERROR: process_lrm_event: LRM operation my_resource_monitor_30000 (29) Timed Out (timeout=60000ms)
> Jun 15 03:38:42 node2 crmd: [4331]: info: process_graph_event: Detected action my_resource_monitor_30000 from a different transition: 40696 vs. 51755
> Jun 15 03:38:42 node2 crmd: [4331]: WARN: update_failcount: Updating failcount for my_resource on node2 after failed monitor: rc=-2 (update=value++, time=1402796322)
> ...
> Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-my_resource (3)
> ..
> Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_perform_update: Sent update 52: fail-count-my_resource=3
> ..
> Jun 15 03:38:42 node2 pengine: [4330]: WARN: common_apply_stickiness: Forcing my_resource away from node2 after 3 failures (max=3)
> 
> 
> SO it seems at midnight the resource already was with a failcount of 2 (perhaps caused by problems happened weeks ago..?) and then at 03:38 got a timeout on monitoring its state and was relocated...
> 
> pacemaker is at 1.1.6-1.27.26

I don't think the automatic reset was part of 1.1.6.
The documentation you're referring to is probably SLES12 specific.

> and I see this list message that seems related:
> http://oss.clusterlabs.org/pipermail/pacemaker/2012-August/015076.html
> 
> Is it perhaps only a matter of setting meta parameter
> failure-timeout
> as explained in High AvailabilityGuide:
> https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#sec.ha.config.hawk.rsc
> 
> in particular
> 5.3.6. Specifying Resource Failover Nodes
> ...
> 4. If you want to automatically expire the failcount for a resource, add the failure-timeout meta attribute to the resource as described in Procedure 5.4: Adding Primitive Resources, Step 7 and enter a Value for the failure-timeout.
> ..
> ?
> 
> Thanks in advance,
> Gianluca
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140625/e5ea2c66/attachment-0004.sig>