[Pacemaker] Info on failcount automatic reset

Gianluca Cecchi gianluca.cecchi at gmail.com
Fri Jun 20 13:29:28 UTC 2014


Hello,
when the monitor action for a resource times out I think its failcount is
incremented by 1, correct?
If so, suppose the next monitor action succeeds, does the failcount value
automatically resets to zero or does it stay to 1?
In the last case, is there any way to configure the cluster to
automatically reset it when the following scheduled monitor completes ok?
or is it a job for the administrator to monitor failcount (eg with crm_mon
output) and then cleanup resource after checking all is ok and resetting so
the failcount value?

I ask because on a SLES 11 SP2 cluster from which I only got the logs I
have these kind of messages:

Jun 15 00:01:18 node2 pengine: [4330]: notice: common_apply_stickiness:
my_resource can fail 1 more times on node2 before being forced off
...
Jun 15 03:38:42 node2 lrmd: [4328]: WARN: my_resource:monitor process (PID
27120) timed out (try 1).  Killing with signal SIGTERM (15).
Jun 15 03:38:42 node2 lrmd: [4328]: WARN: operation monitor[29] on
my_resource for client 4331: pid 27120 timed out
Jun 15 03:38:42 node2 crmd: [4331]: ERROR: process_lrm_event: LRM operation
my_resource_monitor_30000 (29) Timed Out (timeout=60000ms)
Jun 15 03:38:42 node2 crmd: [4331]: info: process_graph_event: Detected
action my_resource_monitor_30000 from a different transition: 40696 vs.
51755
Jun 15 03:38:42 node2 crmd: [4331]: WARN: update_failcount: Updating
failcount for my_resource on node2 after failed monitor: rc=-2
(update=value++, time=1402796322)
...
Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-my_resource (3)
..
Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_perform_update: Sent
update 52: fail-count-my_resource=3
..
Jun 15 03:38:42 node2 pengine: [4330]: WARN: common_apply_stickiness:
Forcing my_resource away from node2 after 3 failures (max=3)


SO it seems at midnight the resource already was with a failcount of 2
(perhaps caused by problems happened weeks ago..?) and then at 03:38 got a
timeout on monitoring its state and was relocated...

pacemaker is at 1.1.6-1.27.26 and I see this list message that seems
related:
http://oss.clusterlabs.org/pipermail/pacemaker/2012-August/015076.html

Is it perhaps only a matter of setting meta parameter
failure-timeout
as explained in High AvailabilityGuide:
https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#sec.ha.config.hawk.rsc

in particular
5.3.6. Specifying Resource Failover Nodes
...
4. If you want to automatically expire the failcount for a resource, add
the failure-timeout meta attribute to the resource as described in
Procedure 5.4: Adding Primitive Resources, Step 7 and enter a Value for the
failure-timeout.
..
?

Thanks in advance,
Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140620/4af0fa3e/attachment-0003.html>


More information about the Pacemaker mailing list