[Pacemaker] Expired fail-count doesn't get cleaned up.

Thu Aug 16 23:42:03 UTC 2012

On Thu, Aug 16, 2012 at 6:20 PM, Jane Du (jadu) <jadu at cisco.com> wrote:
> Hi Andrew,
>
> Can you please help to explain these logs:
>
> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
> WARN: common_apply_stickiness: Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1)
>>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop    mhm_ra  (vm-mgmt-1)
>
> Although the failure count is not expired, it seems that the resource got restarted after 2s. But It is stopped due to stickiness configured.

No, unless it failed with 7 (OCF_STOPPED), failed resources are
considered running until a successful stop operation

>
> When the peer is not available, stickiness should not take effect, right?
>
> Thanks,
>
> Jane
>
> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> Sent: Wednesday, August 15, 2012 4:26 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.
>
> On Thu, Aug 16, 2012 at 9:15 AM, Jane Du (jadu) <jadu at cisco.com> wrote:
>> Hi Andrew,
>>
>> Can you please let me know what the next version number is?
>
> 1.1.8
>
>> What is the link I can find the release information?
>
> It hasn't been released yet
>
>> Thanks,
>> Jane
>>
>>
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Wednesday, August 15, 2012 3:02 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.
>>
>> On Wed, Aug 15, 2012 at 3:01 PM, Jane Du (jadu) <jadu at cisco.com> wrote:
>>> Hi Andrew:
>>>
>>> The fail-count is not cleaned up in my setup after 2s.
>>
>> Based on some more information from David, it appears that there was
>> one more bug lurking that prevented cleanup.
>> This has now been fixed for the next version.
>>
>>> This is a two nodes configuration and only one node is bought up.
>>> When the resource is killed, I expect it comes up again after 2s when no peer
>>> is available.
>>> The configuration and log are followed. Please shade some light.
>>>
>>> crm_mon -f
>>> ============
>>> Last updated: Tue Aug 14 21:13:42 2012
>>> Stack: Heartbeat
>>> Current DC: vm-mgmt-1 (8acb717d-2c8b-4484-8165-efeeb75dc949) - partition with qu
>>> orum
>>> Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>>> 1 Nodes configured, unknown expected votes
>>> 1 Resources configured.
>>> ============
>>>
>>> Online: [ vm-mgmt-1 ]
>>> Migration summary:
>>> * Node vm-mgmt-1:
>>>    mhm_ra: migration-threshold=1 fail-count=1 last-failure='Tue Aug 14 21:13:37
>>> 2012
>>>
>>>
>>> Crm configure show:
>>> node $id="8acb717d-2c8b-4484-8165-efeeb75dc949" vm-mgmt-1
>>> primitive mhm_ra ocf:heartbeat:anything \
>>>         params binfile="mhm_ra.bin" workdir="/usr/bin/" pidfile="/var/run/mhm_ra.pid" \
>>>         op monitor on-fail="restart" interval="1s"
>>> property $id="cib-bootstrap-options" \
>>>         dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>>         cluster-infrastructure="Heartbeat" \
>>>         stonith-enabled="false"
>>> rsc_defaults $id="rsc-options" \
>>>         resource-stickiness="100" \
>>>         migration-threshold="1" \
>>>         failure-timeout="2s"
>>>
>>> kill the mhm_ra, after two seconds:
>>>  [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>>> scope=status  name=fail-count-mhm_ra value=1
>>> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>>> scope=status  name=fail-count-mhm_ra value=1
>>> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>>> scope=status  name=fail-count-mhm_ra value=1
>>> [root at vm-mgmt-1 ~]#
>>> [root at vm-mgmt-1 ~]# date
>>> Tue Aug 14 21:17:09 PDT 2012
>>> [root at vm-mgmt-1 ~]# crm_failcount -G -U vm-mgmt-1 -r mhm_ra
>>>
>>> scope=status  name=fail-count-mhm_ra value=1
>>> Aug 14 21:13:36 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: process_lrm_event: LRM operation mhm_ra_monitor_1000 (call=3, rc=1, cib-update=51, confirmed=false) unknown error
>>> Aug 14 21:13:36 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>>> Aug 14 21:13:37 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: process_graph_event: Detected action mhm_ra_monitor_1000 from a different transition: 1 vs. 2
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=mhm_ra_monitor_1000, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127, cib=0.7.26) : Old event
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: WARN: update_failcount: Updating failcount for mhm_ra on vm-mgmt-1 after failed monitor: rc=1 (update=value++, time=1345004017)
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for fail-count-mhm_ra
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_local_callback: Expanded fail-count-mhm_ra=value++ to 1
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mhm_ra (1)
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 52: Requesting the current CIB: S_POLICY_ENGINE
>>> Aug 14 21:13:37 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 31: fail-count-mhm_ra=1
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: find_hash_entry: Creating hash entry for last-failure-mhm_ra
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mhm_ra (1345004017)
>>> Aug 14 21:13:37 vm-mgmt-1 attrd: [28546]: info: attrd_perform_update: Sent update 34: last-failure-mhm_ra=1345004017
>>> Aug 14 21:13:38 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=52, ref=pe_calc-dc-1345004018-21, seq=1, quorate=1
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-fail-count-mhm_ra, magic=NA, cib=0.7.27) : Transient attribute: update
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=1, tag=nvpair, id=status-8acb717d-2c8b-4484-8165-efeeb75dc949-last-failure-mhm_ra, magic=NA, cib=0.7.28) : Transient attribute: update
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 53: Requesting the current CIB: S_POLICY_ENGINE
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke: Query 54: Requesting the current CIB: S_POLICY_ENGINE
>>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
>>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
>>> Aug 14 21:13:38 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
>>> Aug 14 21:13:38 vm-mgmt-1 pengine: [21194]: notice: LogActions: Leave   mhm_ra  (Started vm-mgmt-1)
>>> Aug 14 21:13:39 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
>>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: do_pe_invoke_callback: Invoking the PE: query=54, ref=pe_calc-dc-1345004019-22, seq=1, quorate=1
>>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: handle_response: pe_calc calculation pe_calc-dc-1345004018-21 is obsolete
>>> Aug 14 21:13:39 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Operation mhm_ra_monitor_0 found resource mhm_ra active on vm-mgmt-1
>>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: unpack_rsc_op: Ignoring expired failure mhm_ra_monitor_1000 (rc=1, magic=0:1;6:1:0:14c32c8c-ee5e-485c-8b17-9468a96ac127) on vm-mgmt-1
>>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: native_print: mhm_ra        (ocf::heartbeat:anything):      Started vm-mgmt-1
>>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: WARN: common_apply_stickiness: Forcing mhm_ra away from vm-mgmt-1 after 1 failures (max=1)
>>> Aug 14 21:13:39 vm-mgmt-1 pengine: [21194]: notice: LogActions: Stop    mhm_ra  (vm-mgmt-1)  ***** when the peer is not available, stickiess should not take effect, right? Jane***
>>
>> No.
>>
>>> Aug 14 21:13:40 vm-mgmt-1 lrmd: [28544]: info: RA output: (mhm_ra:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/anything: line 60: kill: (24612) - No such process
>>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry ****** why this is kept logged when no stonith enabled. Jane ***
>>
>> Because something is very very wrong.  The process should be there
>> even if stonith is not enabled.
>>
>>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: unpack_graph: Unpacked transition 4: 2 actions in 2 synapses
>>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1345004019-22) derived from /var/lib/pengine/pe-input-4.bz2
>>> Aug 14 21:13:40 vm-mgmt-1 crmd: [28547]: info: te_connect_stonith: Attempting connection to fencing daemon...
>>>
>>>
>>> Thanks,
>>>
>>> Jane
>>>
>>> -----Original Message-----
>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>> Sent: Monday, August 13, 2012 7:48 PM
>>> To: mario.penners at gmail.com; The Pacemaker cluster resource manager
>>> Subject: Re: [Pacemaker] Expired fail-count doesn't get cleaned up.
>>>
>>> On Tue, Aug 14, 2012 at 10:15 AM, Mario Penners <mario.penners at gmail.com> wrote:
>>>> Hi David,
>>>>
>>>> I understand the failcount only gets reset after a probe is run.  So you
>>>> need to give it a "crm resource reprobe"  for the expiry timer to be
>>>> evaluated.
>>>
>>> Not for recent versions of 1.1
>>>
>>>>
>>>> However: I do NOT know, when the probes are run (I see them in my logs
>>>> only after failover or start/stop actions are taking place, but not on a
>>>> regular basis).
>>>
>>> When the machine first comes up, and when the admin calls "crm resource reprobe"
>>>
>>>> So one might have to schedule the crm resource reprobe
>>>> as a cron or define it as an own resource?
>>>>
>>>> Cheers!
>>>>
>>>> On Tue, 2012-07-31 at 05:36 -0400, David Coulson wrote:
>>>>> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've
>>>>> a number of resources which have a failure-timeout="60", which most of
>>>>> the time does what it is supposed to.
>>>>>
>>>>> Last night a resource failed, which was part of a clone - While the
>>>>> resource recovered, the fail-count log never got cleaned up. Around
>>>>> every second the DC logged the pengine message below. I manually did a
>>>>> resource cleanup, and it seems happy now. Is there something I should be
>>>>> looking for in the logs to indicate that it 'missed' expiring this?
>>>>>
>>>>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>>>>>
>>>>> Migration summary:
>>>>> * Node dresproddns01:
>>>>>     re-openfire-lsb:0: migration-threshold=1000000 fail-count=1
>>>>> last-failure='Mon Jul 30 21:57:53 2012'
>>>>> * Node dresproddns02:
>>>>>
>>>>>
>>>>> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount:
>>>>> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s)
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org