[ClusterLabs] Antw: Coming in Pacemaker 1.1.17: Per-operation fail counts

Tue Apr 4 10:10:20 EDT 2017

On 04/04/2017 01:18 AM, Ulrich Windl wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 03.04.2017 um 17:00 in Nachricht
> <ae3a7cf4-2ef7-4c4f-ae3f-39f473ed6c01 at redhat.com>:
>> Hi all,
>>
>> Pacemaker 1.1.17 will have a significant change in how it tracks
>> resource failures, though the change will be mostly invisible to users.
>>
>> Previously, Pacemaker tracked a single count of failures per resource --
>> for example, start failures and monitor failures for a given resource
>> were added together.
> 
> That is "per resource operation", not "per resource" ;-)

I mean that there was only a single number to count failures for a given
resource; before this change, failures were not remembered separately by
operation.

>> In a thread on this list last year[1], we discussed adding some new
>> failure handling options that would require tracking failures for each
>> operation type.
> 
> So the existing set of operations failures was restricted to start/stop/monitor? How about master/slave featuring two monitor operations?

No, both previously and with the new changes, all operation failures are
counted (well, except metadata!). The only change is whether they are
remembered per resource or per operation.

>> Pacemaker 1.1.17 will include this tracking, in preparation for adding
>> the new options in a future release.
>>
>> Whereas previously, failure counts were stored in node attributes like
>> "fail-count-myrsc", they will now be stored in multiple node attributes
>> like "fail-count-myrsc#start_0" and "fail-count-myrsc#monitor_10000"
>> (the number distinguishes monitors with different intervals).
> 
> Wouldn't it be thinkable to store is as (transient) resource attribute, either local to a node (LRM) or including the node attribute (CRM)?

Failures are specific to the node the failure occurred on, so it makes
sense to store them as transient node attributes.

So, to be more precise, we previously recorded failures per
node+resource combination, and now we record them per
node+resource+operation+interval combination.

>> Actual cluster behavior will be unchanged in this release (and
>> backward-compatible); the cluster will sum the per-operation fail counts
>> when checking against options such as migration-threshold.
>>
>> The part that will be visible to the user in this release is that the
>> crm_failcount and crm_resource --cleanup tools will now be able to
>> handle individual per-operation fail counts if desired, though by
>> default they will still affect the total fail count for the resource.
> 
> Another thing to think about would be "fail count" vs. "fail rate": Currently there is a fail count, and some reset interval, which allows to build some failure rate from it. Maybe many users just have the requirement that some resource shouldn't fail again and again, but with long uptimes (and then the operatior forgets to reset fail counters), occasional failures (like once in two weeks) shouldn't prevent a resource from running.

Yes, we discussed that a bit in the earlier thread. It would be too much
of an incompatible change and add considerable complexity to start
tracking the failure rate.

Failure clearing hasn't changed -- failures can only be cleared by
manual commands, the failure-timeout option, or a restart of cluster
services on a node.

For the example you mentioned, a high failure-timeout is the best answer
we have. You could set a failure-timeout of 24 hours, and if the
resource went 24 hours without any failures, any older failures would be
forgotten.

>> As an example, if "myrsc" has one start failure and one monitor failure,
>> "crm_failcount -r myrsc --query" will still show 2, but now you can also
>> say "crm_failcount -r myrsc --query --operation start" which will show 1.
> 
> Would accumulated monitor failures ever prevent a resource from starting, or will it force a stop of the resource?

As of this release, failure recovery behavior has not changed. All
operation failures are added together to produce a single fail count per
resource, as was recorded before. The only thing that changed is how
they're recorded.

Failure recovery is controlled by the resource's migration-threshold and
the operation's on-fail. By default, on-fail=restart and
migration-threshold=INFINITY, so a monitor failure would result in
1,000,000 restarts before being banned from the failing node.

> Regards,
> Ulrich
> 
>>
>> Additionally, crm_failcount --delete previously only reset the
>> resource's fail count, but it now behaves identically to crm_resource
>> --cleanup (resetting the fail count and clearing the failure history).
>>
>> Special note for pgsql users: Older versions of common pgsql resource
>> agents relied on a behavior of crm_failcount that is now rejected. While
>> the impact is limited, users are recommended to make sure they have the
>> latest version of their pgsql resource agent before upgrading to
>> pacemaker 1.1.17.
>>
>> [1] http://lists.clusterlabs.org/pipermail/users/2016-September/004096.html