[Pacemaker] Expired fail-count doesn't get cleaned up.

Thu Aug 16 01:15:27 CEST 2012

Hi Andrew,

Can you please let me know what the next version number is?
What is the link I can find the release information?

Thanks,
Jane

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net] 
Sent: Wednesday, August 15, 2012 3:02 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.

On Wed, Aug 15, 2012 at 3:01 PM, Jane Du (jadu) <jadu at cisco.com> wrote:
> Hi Andrew:
>
> The fail-count is not cleaned up in my setup after 2s.

Based on some more information from David, it appears that there was
one more bug lurking that prevented cleanup.
This has now been fixed for the next version.

> This is a two nodes configuration and only one node is bought up.
> When the resource is killed, I expect it comes up again after 2s when no peer
> is available.
> The configuration and log are followed. Please shade some light.
>
> crm_mon -f
> ============
> Last updated: Tue Aug 14 21:13:42 2012
> Stack: Heartbeat
> Current DC: vm-mgmt-1 (8acb717d-2c8b-4484-8165-efeeb75dc949) - partition with qu
> orum
> Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
> 1 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ vm-mgmt-1 ]
> Migration summary:
> * Node vm-mgmt-1:
>    mhm_ra: migration-threshold=1 fail-count=1 last-failure='Tue Aug 14 21:13:37
> 2012
>
>
> Crm configure show:
> node $id="8acb717d-2c8b-4484-8165-efeeb75dc949" vm-mgmt-1
> primitive mhm_ra ocf:heartbeat:anything \
>         params binfile="mhm_ra.bin" workdir="/usr/bin/" pidfile="/var/run/mhm_ra.pid" \
>         op monitor on-fail="restart" interval="1s"
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>         cluster-infrastructure="Heartbeat" \
>         stonith-enabled="false"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100" \
>         migration-threshold="1" \
>         failure-timeout="2s"
>
> kill the mhm_ra, after two seconds:
>  [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
> scope=status  name=fail-count-mhm_ra value=1
> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
> scope=status  name=fail-count-mhm_ra value=1
> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
> scope=status  name=fail-count-mhm_ra value=1
> [root at vm-mgmt-1 ~]#
> [root at vm-mgmt-1 ~]# date
> Tue Aug 14 21:17:09 PDT 2012
> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>
> scope=status  name=fail-count-mhm_ra value=1
> Aug 14 21:13:36 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: process_lrm_event: LRM operation mhm_ra_monitor_1000 (call=3, rc=1, cib-update=51, confirmed=false) unknown error
> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
> Aug 14 21:13:37 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: process_graph_event: Detected action mhm_ra_monitor_1000 from a different transition: 1 vs. 2
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=mhm_ra_monitor_1000, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127, cib=0.7.26) : Old event
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: WARN: update_failcount: Updating failcount for mhm_ra on vm-mgmt-1 after failed monitor: rc=1 (update=value++, time=1345004017)
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for fail-count-mhm_ra
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_local_callback: Expanded fail-count-mhm_ra=value++ to 1
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mhm_ra (1)
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 52: Requesting the current CIB: S_POLICY_ENGINE
> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 31: fail-count-mhm_ra=1
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for last-failure-mhm_ra
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mhm_ra (1345004017)
> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 34: last-failure-mhm_ra=1345004017
> Aug 14 21:13:38 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=52, ref=pe_calc-dc-1345004018-21, seq=1, quorate=1
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-fail-count-mhm_ra, magic=NA, cib=0.7.27) : Transient attribute: update
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-last-failure-mhm_ra, magic=NA, cib=0.7.28) : Transient attribute: update
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 53: Requesting the current CIB: S_POLICY_ENGINE
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 54: Requesting the current CIB: S_POLICY_ENGINE
> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: LogActions: Leave   mhm_ra  (Started vm-mgmt-1)
> Aug 14 21:13:39 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=54, ref=pe_calc-dc-1345004019-22, seq=1, quorate=1
> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: handle_response: pe_calc calculation pe_calc-dc-1345004018-21 is obsolete
> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: WARN: common_apply_stickiness: Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1)
> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop    mhm_ra  (vm-mgmt-1)  ***** when the peer is not available, stickiess should not take effect, right? Jane***

No.

> Aug 14 21:13:40 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry ****** why this is kept logged when no stonith enabled. Jane ***

Because something is very very wrong.  The process should be there
even if stonith is not enabled.

> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: unpack_graph: Unpacked transition 4: 2 actions in 2 synapses
> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1345004019-22) derived from /var/lib/pengine/pe-input-4.bz2
> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>
>
> Thanks,
>
> Jane
>
> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> Sent: Monday, August 13, 2012 7:48 PM
> To: mario.penners at gmail.com; The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.
>
> On Tue, Aug 14, 2012 at 10:15 AM, Mario Penners <mario.penners at gmail.com> wrote:
>> Hi David,
>>
>> I understand the failcount only gets reset after a probe is run.  So you
>> need to give it a "crm resource reprobe"  for the expiry timer to be
>> evaluated.
>
> Not for recent versions of 1.1
>
>>
>> However: I do NOT know, when the probes are run (I see them in my logs
>> only after failover or start/stop actions are taking place, but not on a
>> regular basis).
>
> When the machine first comes up, and when the admin calls "crm resource reprobe"
>
>> So one might have to schedule the crm resource reprobe
>> as a cron or define it as an own resource?
>>
>> Cheers!
>>
>> On Tue, 2012-07-31 at 05:36 -0400, David Coulson wrote:
>>> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've
>>> a number of resources which have a failure-timeout="60", which most of
>>> the time does what it is supposed to.
>>>
>>> Last night a resource failed, which was part of a clone - While the
>>> resource recovered, the fail-count log never got cleaned up. Around
>>> every second the DC logged the pengine message below. I manually did a
>>> resource cleanup, and it seems happy now. Is there something I should be
>>> looking for in the logs to indicate that it 'missed' expiring this?
>>>
>>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>>>
>>> Migration summary:
>>> * Node dresproddns01:
>>>     re-openfire-lsb:0: migration-threshold=1000000 fail-count=1
>>> last-failure='Mon Jul 30 21:57:53 2012'
>>> * Node dresproddns02:
>>>
>>>
>>> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount:
>>> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s)
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org