[Pacemaker] Expired fail-count doesn't get cleaned up.

Wed Aug 15 19:26:23 EDT 2012

On Thu, Aug 16, 2012 at 9:15 AM, Jane Du (jadu) <jadu at cisco.com> wrote:
> Hi Andrew,
>
> Can you please let me know what the next version number is?

1.1.8

> What is the link I can find the release information?

It hasn't been released yet

> Thanks,
> Jane
>
>
> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> Sent: Wednesday, August 15, 2012 3:02 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.
>
> On Wed, Aug 15, 2012 at 3:01 PM, Jane Du (jadu) <jadu at cisco.com> wrote:
>> Hi Andrew:
>>
>> The fail-count is not cleaned up in my setup after 2s.
>
> Based on some more information from David, it appears that there was
> one more bug lurking that prevented cleanup.
> This has now been fixed for the next version.
>
>> This is a two nodes configuration and only one node is bought up.
>> When the resource is killed, I expect it comes up again after 2s when no peer
>> is available.
>> The configuration and log are followed. Please shade some light.
>>
>> crm_mon -f
>> ============
>> Last updated: Tue Aug 14 21:13:42 2012
>> Stack: Heartbeat
>> Current DC: vm-mgmt-1 (8acb717d-2c8b-4484-8165-efeeb75dc949) - partition with qu
>> orum
>> Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>> 1 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ vm-mgmt-1 ]
>> Migration summary:
>> * Node vm-mgmt-1:
>>    mhm_ra: migration-threshold=1 fail-count=1 last-failure='Tue Aug 14 21:13:37
>> 2012
>>
>>
>> Crm configure show:
>> node $id="8acb717d-2c8b-4484-8165-efeeb75dc949" vm-mgmt-1
>> primitive mhm_ra ocf:heartbeat:anything \
>>         params binfile="mhm_ra.bin" workdir="/usr/bin/" pidfile="/var/run/mhm_ra.pid" \
>>         op monitor on-fail="restart" interval="1s"
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>         cluster-infrastructure="Heartbeat" \
>>         stonith-enabled="false"
>> rsc_defaults $id="rsc-options" \
>>         resource-stickiness="100" \
>>         migration-threshold="1" \
>>         failure-timeout="2s"
>>
>> kill the mhm_ra, after two seconds:
>>  [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>> scope=status  name=fail-count-mhm_ra value=1
>> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>> scope=status  name=fail-count-mhm_ra value=1
>> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>> scope=status  name=fail-count-mhm_ra value=1
>> [root at vm-mgmt-1 ~]#
>> [root at vm-mgmt-1 ~]# date
>> Tue Aug 14 21:17:09 PDT 2012
>> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>>
>> scope=status  name=fail-count-mhm_ra value=1
>> Aug 14 21:13:36 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: process_lrm_event: LRM operation mhm_ra_monitor_1000 (call=3, rc=1, cib-update=51, confirmed=false) unknown error
>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>> Aug 14 21:13:37 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: process_graph_event: Detected action mhm_ra_monitor_1000 from a different transition: 1 vs. 2
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=mhm_ra_monitor_1000, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127, cib=0.7.26) : Old event
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: WARN: update_failcount: Updating failcount for mhm_ra on vm-mgmt-1 after failed monitor: rc=1 (update=value++, time=1345004017)
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for fail-count-mhm_ra
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_local_callback: Expanded fail-count-mhm_ra=value++ to 1
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mhm_ra (1)
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 52: Requesting the current CIB: S_POLICY_ENGINE
>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 31: fail-count-mhm_ra=1
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for last-failure-mhm_ra
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mhm_ra (1345004017)
>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 34: last-failure-mhm_ra=1345004017
>> Aug 14 21:13:38 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=52, ref=pe_calc-dc-1345004018-21, seq=1, quorate=1
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-fail-count-mhm_ra, magic=NA, cib=0.7.27) : Transient attribute: update
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-last-failure-mhm_ra, magic=NA, cib=0.7.28) : Transient attribute: update
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 53: Requesting the current CIB: S_POLICY_ENGINE
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 54: Requesting the current CIB: S_POLICY_ENGINE
>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: LogActions: Leave   mhm_ra  (Started vm-mgmt-1)
>> Aug 14 21:13:39 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=54, ref=pe_calc-dc-1345004019-22, seq=1, quorate=1
>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: handle_response: pe_calc calculation pe_calc-dc-1345004018-21 is obsolete
>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: WARN: common_apply_stickiness: Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1)
>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop    mhm_ra  (vm-mgmt-1)  ***** when the peer is not available, stickiess should not take effect, right? Jane***
>
> No.
>
>> Aug 14 21:13:40 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry ****** why this is kept logged when no stonith enabled. Jane ***
>
> Because something is very very wrong.  The process should be there
> even if stonith is not enabled.
>
>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: unpack_graph: Unpacked transition 4: 2 actions in 2 synapses
>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1345004019-22) derived from /var/lib/pengine/pe-input-4.bz2
>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>>
>>
>> Thanks,
>>
>> Jane
>>
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Monday, August 13, 2012 7:48 PM
>> To: mario.penners at gmail.com; The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.
>>
>> On Tue, Aug 14, 2012 at 10:15 AM, Mario Penners <mario.penners at gmail.com> wrote:
>>> Hi David,
>>>
>>> I understand the failcount only gets reset after a probe is run.  So you
>>> need to give it a "crm resource reprobe"  for the expiry timer to be
>>> evaluated.
>>
>> Not for recent versions of 1.1
>>
>>>
>>> However: I do NOT know, when the probes are run (I see them in my logs
>>> only after failover or start/stop actions are taking place, but not on a
>>> regular basis).
>>
>> When the machine first comes up, and when the admin calls "crm resource reprobe"
>>
>>> So one might have to schedule the crm resource reprobe
>>> as a cron or define it as an own resource?
>>>
>>> Cheers!
>>>
>>> On Tue, 2012-07-31 at 05:36 -0400, David Coulson wrote:
>>>> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've
>>>> a number of resources which have a failure-timeout="60", which most of
>>>> the time does what it is supposed to.
>>>>
>>>> Last night a resource failed, which was part of a clone - While the
>>>> resource recovered, the fail-count log never got cleaned up. Around
>>>> every second the DC logged the pengine message below. I manually did a
>>>> resource cleanup, and it seems happy now. Is there something I should be
>>>> looking for in the logs to indicate that it 'missed' expiring this?
>>>>
>>>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>>>>
>>>> Migration summary:
>>>> * Node dresproddns01:
>>>>     re-openfire-lsb:0: migration-threshold=1000000 fail-count=1
>>>> last-failure='Mon Jul 30 21:57:53 2012'
>>>> * Node dresproddns02:
>>>>
>>>>
>>>> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount:
>>>> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s)
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org