[ClusterLabs] Failcount not resetting to zero after failure-timeout

Thu Nov 26 04:26:09 UTC 2015

Hi Guys,

We are facing this issue again and again. Fail count is not being reset to
zero and bec of this some of the resources are not being started on any
node. Can some one plz tell what might be the cause.
Help is appreciated :-)

On Mon, Nov 23, 2015 at 11:06 AM, Pritam Kharat <
pritam.kharat at oneconvergence.com> wrote:

> Could some one please reply ?
>
> On Thu, Nov 19, 2015 at 10:28 PM, Pritam Kharat <
> pritam.kharat at oneconvergence.com> wrote:
>
>>
>> Hi All,
>>
>> I have 2 node HA setup. I have added migration_threshold=5 and
>> failure-timeout=120s for my resources. When migration threshold is reached
>> to 5 resources are migrated to other node. But once observed fail-count is
>> not reset back to zero after 2 mins. The setup was in the same state almost
>> for 3 hours but still fail-count did not reset to zero.
>>
>> Then I tried the same test again but could not reproduce this.When
>> compared the logs of success scenario with failed scenario found that
>> pengine did not take action to clear failcount.
>>
>>
>>
>> Success logs
>> *Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
>>  Clearing expired failcount for oc-service-manager on sc-node-1*
>> Nov 19 15:27:08 [16409] sc-node-1    pengine:     info: get_failcount_full:
>>     oc-service-manager has failed 5 times on sc-node-1
>> Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
>>  Clearing expired failcount for oc-service-manager on sc-node-1
>> Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
>>  Re-initiated expired calculated failure oc-service-manager_last_failure_0
>> (rc=7, magic=0:7;3:145:0:258ae879-832f-4126-a7d7-e57bd3fdcdb1) on
>> sc-node-1
>> 4:58 PM
>>
>>
>> Failure logs
>> Nov 04 22:23:39 [6831] sc-HA2    pengine:  warning: unpack_rsc_op:
>>  Processing failed op monitor for oc-service-manager on sc-HA1: not
>> running (7)
>> Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: native_print:
>> oc-service-manager      (upstart:oc-service-manager):   Started sc-HA2
>> *Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: get_failcount_full:
>>         oc-service-manager has failed 5 times on sc-HA1*
>> Nov 04 22:23:39 [6831] sc-HA2    pengine:  warning: common_apply_stickiness:
>>    Forcing oc-service-manager away from sc-HA1 after 5 failures (max=5)
>> Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: rsc_merge_weights:
>>  oc-service-manager: Rolling back scores from oc-fw-agent
>> Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: LogActions:
>> Leave   oc-service-manager      (Started sc-HA2)
>>
>>
>> What might be the reason of - in failure case this action did not take
>> place ?
>> *notice: unpack_rsc_op:  Clearing expired failcount for
>> oc-service-manager *
>>
>>
>> --
>> Thanks and Regards,
>> Pritam Kharat.
>>
>
>
>
> --
> Thanks and Regards,
> Pritam Kharat.
>

-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151126/c42bdc07/attachment.htm>