[ClusterLabs] clearing failed actions

Wed Jun 7 16:14:05 EDT 2017

On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> Ken,
> 
> I noticed something strange, this might be the issue.
> 
> In some cases, even the manual cleanup does not work.
> 
> I have a failed action of resource "A" on node "a". DC is node "b".
> 
> e.g.
> 	Failed actions:
>     jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1, status=complete, last-rc-change=Thu Jun  1 14:13:36 2017
> 
> 
> When I attempt to do a "crm resource cleanup A" from node "b", nothing happens. Basically the lrmd on "a" is not notified that it should monitor the resource.
> 
> 
> When I execute a "crm resource cleanup A" command on node "a" (where the operation failed) , the failed action is cleared properly.
> 
> Why could this be happening?
> Which component should be responsible for this? pengine, crmd, lrmd?

The crm shell will send commands to attrd (to clear fail counts) and
crmd (to clear the resource history), which in turn will record changes
in the cib.

I'm not sure how crm shell implements it, but crm_resource sends
individual messages to each node when cleaning up a resource without
specifying a particular node. You could check the pacemaker log on each
node to see whether attrd and crmd are receiving those commands, and
what they do in response.

>> -----Original Message-----
>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>> Sent: Thursday, June 1, 2017 6:57 PM
>> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-source
>> clustering welcomed <users at clusterlabs.org>
>> Subject: Re: [ClusterLabs] clearing failed actions
>>
>> thanks Ken,
>>
>>
>>
>>
>>
>>> -----Original Message-----
>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>>> Sent: Thursday, June 1, 2017 12:04 AM
>>> To: users at clusterlabs.org
>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>
>>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
>>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
>>>>> Hi Ken,
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
>>>>>> To: users at clusterlabs.org
>>>>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>>>>
>>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Shouldn't the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> cluster-recheck-interval="2m"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes
>>> and
>>>>>>> clean the failcounts?
>>>>>>
>>>>>> It instructs pacemaker to recalculate whether any actions need to be
>>>>>> taken (including expiring any failcounts appropriately).
>>>>>>
>>>>>>> At the primitive level I also have a
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> migration-threshold="30" failure-timeout="2m"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> but whenever I have a failure, it remains there forever.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What could be causing this?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> thanks,
>>>>>>>
>>>>>>> Attila
>>>>>> Is it a single old failure, or a recurring failure? The failure timeout
>>>>>> works in a somewhat nonintuitive way. Old failures are not individually
>>>>>> expired. Instead, all failures of a resource are simultaneously cleared
>>>>>> if all of them are older than the failure-timeout. So if something keeps
>>>>>> failing repeatedly (more frequently than the failure-timeout), none of
>>>>>> the failures will be cleared.
>>>>>>
>>>>>> If it's not a repeating failure, something odd is going on.
>>>>>
>>>>> It is not a repeating failure. Let's say that a resource fails for whatever
>>> action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm
>>> resource cleanup <resource name>". Even after days or weeks, even
>> though
>>> I see in the logs that cluster is rechecked every 120 seconds.
>>>>>
>>>>> How could I troubleshoot this issue?
>>>>>
>>>>> thanks!
>>>>
>>>>
>>>> Ah, I see what you're saying. That's expected behavior.
>>>>
>>>> The failure-timeout applies to the failure *count* (which is used for
>>>> checking against migration-threshold), not the failure *history* (which
>>>> is used for the status display).
>>>>
>>>> The idea is to have it no longer affect the cluster behavior, but still
>>>> allow an administrator to know that it happened. That's why a manual
>>>> cleanup is required to clear the history.
>>>
>>> Hmm, I'm wrong there ... failure-timeout does expire the failure history
>>> used for status display.
>>>
>>> It works with the current versions. It's possible 1.1.10 had issues with
>>> that.
>>>
>>
>> Well if nothing helps I will try to upgrade to a more recent version..
>>
>>
>>
>>> Check the status to see which node is DC, and look at the pacemaker log
>>> there after the failure occurred. There should be a message about the
>>> failcount expiring. You can also look at the live CIB and search for
>>> last_failure to see what is used for the display.
>> [AM]
>>
>> In the pacemaker log I see at every recheck interval the following lines:
>>
>> Jun 01 16:54:08 [8700] ctabsws2    pengine:  warning: unpack_rsc_op:
>> Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1)
>>
>> If I check the  CIB for the failure I see:
>>
>> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-failure-
>> jboss_admin2" value="1496326649"/>
>>             <lrm_rsc_op id="jboss_admin2_last_failure_0"
>> operation_key="jboss_admin2_start_0" operation="start" crm-debug-
>> origin="do_update_resource" crm_feature_set="3.0.7" transition-
>> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
>> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114" rc-
>> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
>> change="1496326469" exec-time="180001" queue-time="0" op-
>> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
>>
>>
>> Really have no clue why this isn't cleared...