[Pacemaker] clear failcount when monitor is successful?

Wed Apr 24 07:40:16 EDT 2013

On 24-04-13 13:24, Lars Marowsky-Bree wrote:
> On 2013-04-24T10:37:24, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>
>> --> start situation
>> * scope=status  name=fail-count-d_tomcat value=0
>> * depending resource group running on node
>> * crm_mon shows everything ok
>>
>> --> a failure occurs
>> * scope=status  name=fail-count-d_tomcat value=1
>> * depending resource group stopping on node
>> * crm_mon shows failure
>>
>> --> After 30s (= failure-timeout)
>> * scope=status  name=fail-count-d_tomcat value=1
>> * depending resource group not running on node
>> * crm_mon shows NO failure !!!!!
> This, by itself, is not necessarily surprising. The property
> "cluster-reheck-interval" defines how often the PE gets re-run, and
> defaults to 15 minutes.
>
> This is not dynamically adjusted based on failure-timeouts, and if this
> feature becomes more widely used, there probably should be a "better"
> way to handle/trigger these while still avoiding swamping the cluster
> with empty transitions etc.
>
> In short: right now, if you want a failure-timeout of 30s to be
> meaningful, you need to set cluster-recheck-interval to something
> shorter.
>
>> --> After something changes in the cluster or the recheck interval
>> * scope=status  name=fail-count-d_tomcat value=0
>> * depending resource group can run on node
>> * crm_mon shows no failure
>> * BUT my resource is still monitored and failing!
> I'm not sure I perfectly get what you're saying here with the last
> sentence. Did the cluster try to restart it, and it failed again, yet
> the failure was ignored this time around?
The cluster didn't stop or restart my cloned resource, but it is still 
monitoring it.
Which is expected as I configured the on-fail to block.
I see that the monitor section of my ocf is executed every 15s (=monitor 
interval),
and that it is still failing (returning with $OCF_ERR_GENERIC)
>
>> I find it disturbing that a resource with a failing monitor has a 0
>> failcount, shows ok in crm_mon and allows to run the depending
>> resources.
> Yes, if I got that right, that would be a problem - please create a
> hb_/crm_report and open a bug.

Ok, will create a crm_report containing my tests.
>
>
>
> Regards,
>      Lars
>