[Pacemaker] Expired fail-count doesn't get cleaned up.

Wed Aug 15 07:01:38 CEST 2012

Hi Andrew:

The fail-count is not cleaned up in my setup after 2s.
This is a two nodes configuration and only one node is bought up.
When the resource is killed, I expect it comes up again after 2s when no peer 
is available. 
The configuration and log are followed. Please shade some light.

crm_mon -f
============
Last updated: Tue Aug 14 21:13:42 2012
Stack: Heartbeat
Current DC: vm-mgmt-1 (8acb717d-2c8b-4484-8165-efeeb75dc949) - partition with qu
orum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
1 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ vm-mgmt-1 ]
Migration summary:
* Node vm-mgmt-1:
   mhm_ra: migration-threshold=1 fail-count=1 last-failure='Tue Aug 14 21:13:37
2012

Crm configure show:
node $id="8acb717d-2c8b-4484-8165-efeeb75dc949" vm-mgmt-1
primitive mhm_ra ocf:heartbeat:anything \
        params binfile="mhm_ra.bin" workdir="/usr/bin/" pidfile="/var/run/mhm_ra.pid" \
        op monitor on-fail="restart" interval="1s"
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100" \
        migration-threshold="1" \
        failure-timeout="2s"

kill the mhm_ra, after two seconds:
 [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
scope=status  name=fail-count-mhm_ra value=1
[root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
scope=status  name=fail-count-mhm_ra value=1
[root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
scope=status  name=fail-count-mhm_ra value=1
[root at vm-mgmt-1 ~]#
[root at vm-mgmt-1 ~]# date
Tue Aug 14 21:17:09 PDT 2012
[root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra

scope=status  name=fail-count-mhm_ra value=1
Aug 14 21:13:36 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process 
Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: process_lrm_event: LRM operation mhm_ra_monitor_1000 (call=3, rc=1, cib-update=51, confirmed=false) unknown error
Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
Aug 14 21:13:37 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process 
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: process_graph_event: Detected action mhm_ra_monitor_1000 from a different transition: 1 vs. 2
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=mhm_ra_monitor_1000, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127, cib=0.7.26) : Old event
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: WARN: update_failcount: Updating failcount for mhm_ra on vm-mgmt-1 after failed monitor: rc=1 (update=value++, time=1345004017)
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for fail-count-mhm_ra
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_local_callback: Expanded fail-count-mhm_ra=value++ to 1
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mhm_ra (1)
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 52: Requesting the current CIB: S_POLICY_ENGINE
Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 31: fail-count-mhm_ra=1
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for last-failure-mhm_ra
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mhm_ra (1345004017)
Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 34: last-failure-mhm_ra=1345004017
Aug 14 21:13:38 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process 
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=52, ref=pe_calc-dc-1345004018-21, seq=1, quorate=1
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-fail-count-mhm_ra, magic=NA, cib=0.7.27) : Transient attribute: update
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-last-failure-mhm_ra, magic=NA, cib=0.7.28) : Transient attribute: update
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 53: Requesting the current CIB: S_POLICY_ENGINE
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 54: Requesting the current CIB: S_POLICY_ENGINE
Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra	(ocf::heartbeat:anything):	Started vm-mgmt-1
Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: LogActions: Leave   mhm_ra	(Started vm-mgmt-1)
Aug 14 21:13:39 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process 
Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=54, ref=pe_calc-dc-1345004019-22, seq=1, quorate=1
Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: handle_response: pe_calc calculation pe_calc-dc-1345004018-21 is obsolete
Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra	(ocf::heartbeat:anything):	Started vm-mgmt-1
Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: WARN: common_apply_stickiness: Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1)
Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop    mhm_ra	(vm-mgmt-1)  ***** when the peer is not available, stickiess should not take effect, right? Jane***
Aug 14 21:13:40 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process 
Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry ****** why this is kept logged when no stonith enabled. Jane ***
Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: unpack_graph: Unpacked transition 4: 2 actions in 2 synapses
Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1345004019-22) derived from /var/lib/pengine/pe-input-4.bz2
Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...

Thanks,

Jane

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net] 
Sent: Monday, August 13, 2012 7:48 PM
To: mario.penners at gmail.com; The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.

On Tue, Aug 14, 2012 at 10:15 AM, Mario Penners <mario.penners at gmail.com> wrote:
> Hi David,
>
> I understand the failcount only gets reset after a probe is run.  So you
> need to give it a "crm resource reprobe"  for the expiry timer to be
> evaluated.

Not for recent versions of 1.1

>
> However: I do NOT know, when the probes are run (I see them in my logs
> only after failover or start/stop actions are taking place, but not on a
> regular basis).

When the machine first comes up, and when the admin calls "crm resource reprobe"

> So one might have to schedule the crm resource reprobe
> as a cron or define it as an own resource?
>
> Cheers!
>
> On Tue, 2012-07-31 at 05:36 -0400, David Coulson wrote:
>> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've
>> a number of resources which have a failure-timeout="60", which most of
>> the time does what it is supposed to.
>>
>> Last night a resource failed, which was part of a clone - While the
>> resource recovered, the fail-count log never got cleaned up. Around
>> every second the DC logged the pengine message below. I manually did a
>> resource cleanup, and it seems happy now. Is there something I should be
>> looking for in the logs to indicate that it 'missed' expiring this?
>>
>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>>
>> Migration summary:
>> * Node dresproddns01:
>>     re-openfire-lsb:0: migration-threshold=1000000 fail-count=1
>> last-failure='Mon Jul 30 21:57:53 2012'
>> * Node dresproddns02:
>>
>>
>> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount:
>> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s)
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org