[ClusterLabs] clearing failed actions

Attila Megyeri amegyeri at minerva-soft.com
Thu Jun 1 12:57:09 EDT 2017


thanks Ken,





> -----Original Message-----
> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> Sent: Thursday, June 1, 2017 12:04 AM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >> Hi Ken,
> >>
> >>
> >>> -----Original Message-----
> >>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> >>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>> To: users at clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>> Hi,
> >>>>
> >>>>
> >>>>
> >>>> Shouldn't the
> >>>>
> >>>>
> >>>>
> >>>> cluster-recheck-interval="2m"
> >>>>
> >>>>
> >>>>
> >>>> property instruct pacemaker to recheck the cluster every 2 minutes
> and
> >>>> clean the failcounts?
> >>>
> >>> It instructs pacemaker to recalculate whether any actions need to be
> >>> taken (including expiring any failcounts appropriately).
> >>>
> >>>> At the primitive level I also have a
> >>>>
> >>>>
> >>>>
> >>>> migration-threshold="30" failure-timeout="2m"
> >>>>
> >>>>
> >>>>
> >>>> but whenever I have a failure, it remains there forever.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> What could be causing this?
> >>>>
> >>>>
> >>>>
> >>>> thanks,
> >>>>
> >>>> Attila
> >>> Is it a single old failure, or a recurring failure? The failure timeout
> >>> works in a somewhat nonintuitive way. Old failures are not individually
> >>> expired. Instead, all failures of a resource are simultaneously cleared
> >>> if all of them are older than the failure-timeout. So if something keeps
> >>> failing repeatedly (more frequently than the failure-timeout), none of
> >>> the failures will be cleared.
> >>>
> >>> If it's not a repeating failure, something odd is going on.
> >>
> >> It is not a repeating failure. Let's say that a resource fails for whatever
> action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm
> resource cleanup <resource name>". Even after days or weeks, even though
> I see in the logs that cluster is rechecked every 120 seconds.
> >>
> >> How could I troubleshoot this issue?
> >>
> >> thanks!
> >
> >
> > Ah, I see what you're saying. That's expected behavior.
> >
> > The failure-timeout applies to the failure *count* (which is used for
> > checking against migration-threshold), not the failure *history* (which
> > is used for the status display).
> >
> > The idea is to have it no longer affect the cluster behavior, but still
> > allow an administrator to know that it happened. That's why a manual
> > cleanup is required to clear the history.
> 
> Hmm, I'm wrong there ... failure-timeout does expire the failure history
> used for status display.
> 
> It works with the current versions. It's possible 1.1.10 had issues with
> that.
> 

Well if nothing helps I will try to upgrade to a more recent version..



> Check the status to see which node is DC, and look at the pacemaker log
> there after the failure occurred. There should be a message about the
> failcount expiring. You can also look at the live CIB and search for
> last_failure to see what is used for the display.
[AM] 

In the pacemaker log I see at every recheck interval the following lines:

Jun 01 16:54:08 [8700] ctabsws2    pengine:  warning: unpack_rsc_op:    Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1)

If I check the  CIB for the failure I see:

<nvpair id="status-168362322-last-failure-jboss_admin2" name="last-failure-jboss_admin2" value="1496326649"/>
            <lrm_rsc_op id="jboss_admin2_last_failure_0" operation_key="jboss_admin2_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114" rc-code="1" op-status="2" interval="0" last-run="1496326469" last-rc-change="1496326469" exec-time="180001" queue-time="0" op-digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>


Really have no clue why this isn't cleared...



> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list