[ClusterLabs] Pacemaker shows false status of a resource and doesn't react on OCF_NOT_RUNNING rc.

Tue Jan 19 17:02:52 UTC 2016

Just in case, this is the monitor function from the resource agent:
ra_monitor() {
#   ocf_log info "$RA: [monitor]"
    systemctl status ${service}
    rc=$?
    if [ "$rc" -eq "0" ]; then
        return $OCF_SUCCESS
    fi

    ocf_log warn "$RA: [monitor] : got rc=$rc"
    return $OCF_NOT_RUNNING
}

Thank you,
Kostia

On Tue, Jan 19, 2016 at 6:30 PM, Kostiantyn Ponomarenko <
konstantin.ponomarenko at gmail.com> wrote:

> The resource that wasn't running, but was reported as running, is
> "adminServer".
>
> Here are a brief chronological description:
>
> [Jan 19 23:42:16] The first time Pacemaker triggers its monitor function
> at line #1107. (those lines are from its Resource Agent)
> [Jan 19 23:42:16] Then Pacemaker starts the resource - line #1191.
> [Jan 19 11:42:53] The first failure is reported by monitor operation at
> line #1543.
> [Jan 19 11:42:53] The fail-count is set, but I don't see any attempt from
> Pacemaker to "start" the resource - the start function is not called (from
> the logs) - line #1553.
> [Jan 19 12:27:56] Then adminServer's monitor operation keeps returning
> $OCF_NOT_RUNNING - starts at line #1860.
> [Jan 19 12:57:53] Then the expired failcount is cleared at line #1969.
> [Jan 19 12:57:53] Another call of the monitor function happens at line
> #2038.
> [Jan 19 12:57:53] I assume that the line #2046 means "not running" (?).
> [Jan 19 12:57:53] The "stop" function is called - line #2150
> [Jan 19 12:57:53] The "start" function is called and the resource is
> successfully started - line #2164
>
>
> The time change occurred while cluster was starting, I see this from
> "journalctl --since="2016-01-19" --until="2016-01-20"":
>
> Jan 19 23:10:39 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c61c 0c clock_step
> -43193.793349 s
> Jan 19 11:10:45 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c614 04 freq_mode
> Jan 19 11:10:45 A2-2U12-302-LS systemd[1]: Time has been changed
>
> I am attaching corosync.log.
>
> Thank you,
> Kostia
>
> On Tue, Jan 19, 2016 at 5:17 PM, Bogdan Dobrelya <bdobrelia at mirantis.com>
> wrote:
>
>> On 19.01.2016 16:13, Ken Gaillot wrote:
>> > On 01/19/2016 06:49 AM, Kostiantyn Ponomarenko wrote:
>> >> One of resources in my cluster is not actually running, but "crm_mon"
>> shows
>> >> it with the "Started" status.
>> >> Its resource agent's monitor function returns "$OCF_NOT_RUNNING", but
>> >> Pacemaker doesn't react on this anyhow - crm_mon show the resource as
>> >> Started.
>> >> I couldn't find an explanation to this behavior, so I suppose it is a
>> bug,
>> >> is it?
>> >
>> > That is unexpected. Can you post the configuration and logs from around
>> > the time of the issue?
>> >
>>
>> Oh, sorry, I forgot to mention the related thread [0]. That is exactly
>> the case I reported there. Looks same, so I thought you've just updated
>> my thread :)
>>
>> These may be merged perhaps.
>>
>> [0] http://clusterlabs.org/pipermail/users/2016-January/002035.html
>>
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>>
>> --
>> Best regards,
>> Bogdan Dobrelya,
>> Irc #bogdando
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160119/1a9a5b0d/attachment.htm>