[Pacemaker] clear failcount when monitor is successful?

Wed Apr 24 13:34:45 UTC 2013

bug and crm_report created: http://bugs.clusterlabs.org/show_bug.cgi?id=5021

gr.
Johan
On 24-04-13 13:40, Johan Huysmans wrote:
>
> On 24-04-13 13:24, Lars Marowsky-Bree wrote:
>> On 2013-04-24T10:37:24, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>
>>> --> start situation
>>> * scope=status  name=fail-count-d_tomcat value=0
>>> * depending resource group running on node
>>> * crm_mon shows everything ok
>>>
>>> --> a failure occurs
>>> * scope=status  name=fail-count-d_tomcat value=1
>>> * depending resource group stopping on node
>>> * crm_mon shows failure
>>>
>>> --> After 30s (= failure-timeout)
>>> * scope=status  name=fail-count-d_tomcat value=1
>>> * depending resource group not running on node
>>> * crm_mon shows NO failure !!!!!
>> This, by itself, is not necessarily surprising. The property
>> "cluster-reheck-interval" defines how often the PE gets re-run, and
>> defaults to 15 minutes.
>>
>> This is not dynamically adjusted based on failure-timeouts, and if this
>> feature becomes more widely used, there probably should be a "better"
>> way to handle/trigger these while still avoiding swamping the cluster
>> with empty transitions etc.
>>
>> In short: right now, if you want a failure-timeout of 30s to be
>> meaningful, you need to set cluster-recheck-interval to something
>> shorter.
>>
>>> --> After something changes in the cluster or the recheck interval
>>> * scope=status  name=fail-count-d_tomcat value=0
>>> * depending resource group can run on node
>>> * crm_mon shows no failure
>>> * BUT my resource is still monitored and failing!
>> I'm not sure I perfectly get what you're saying here with the last
>> sentence. Did the cluster try to restart it, and it failed again, yet
>> the failure was ignored this time around?
> The cluster didn't stop or restart my cloned resource, but it is still 
> monitoring it.
> Which is expected as I configured the on-fail to block.
> I see that the monitor section of my ocf is executed every 15s 
> (=monitor interval),
> and that it is still failing (returning with $OCF_ERR_GENERIC)
>>
>>> I find it disturbing that a resource with a failing monitor has a 0
>>> failcount, shows ok in crm_mon and allows to run the depending
>>> resources.
>> Yes, if I got that right, that would be a problem - please create a
>> hb_/crm_report and open a bug.
>
> Ok, will create a crm_report containing my tests.
>>
>>
>>
>> Regards,
>>      Lars
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org