[Pacemaker] monitor on-fail=ignore not restarting when resource reported as stopped

Fri Dec 6 11:21:02 EST 2013

 

------------------------------------------------------------------------
*From: *Michael Schwartzkopff <ms at sys4.de>
*Sent: * 2013-12-06 11:16:17 E
*To: *pacemaker at oss.clusterlabs.org
*Subject: *Re: [Pacemaker] monitor on-fail=ignore not restarting when
resource reported as stopped

> Am Freitag, 6. Dezember 2013, 11:02:11 schrieben Sie:
>> ------------------------------------------------------------------------
>> *From: *Michael Schwartzkopff <ms at sys4.de>
>> *Sent: * 2013-12-06 10:50:19 E
>> *To: *The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
>> *Subject: *Re: [Pacemaker] monitor on-fail=ignore not restarting when
>> resource
>> reported as stopped
>>
>>> Am Freitag, 6. Dezember 2013, 10:11:07 schrieb Patrick Hemmer:
>>>> I have a resource which updates DNS records (Amazon's Route53). When it
>>>> performs it's `monitor` action, it can sometimes fail because of issues
>>>> with Amazon's API. So I want failures to be ignored for the monitor
>>>> action, and so I set `op monitor on-fail=ignore`. However now when the
>>>> monitor action comes back as 'stopped', pacemaker does nothing. In my
>>>> opinion a "stopped" return code should not be a failure condition, and
>>>> thus the `on-fail=ignore` should not apply. It basically makes the
>>>> monitor option completely useless. It won't do anything on failure, it
>>>> won't do anything on stopped, so you might as well not have a monitor
>>>> action at all.
>>>>
>>>> If this is a bug I can create a bug report, just not sure if this is
>>>> deliberate or not.
>>> This is not bug but expected behaviour. A monitoring operation for a
>>> started resource interpretes everything besides "Started" as failure.
>>> Also if your resource is stopped.
>>>
>>> And you told the resoure to ignore failures.
>>>
>>> It would be better to improve your resource agent to detect error
>>> conditions. It could read the state it should be in from pacemaker and
>>> compare it with the reality.
>> It does detect the error condition. It then returns with
>> $OCF_ERR_GENERIC. This is the only possible way to respond. It's also
>> the right way. If the script got an error trying to query the status, it
>> doesn't know if it's really running or not. If it's not running,
>> returning $OCF_SUCCESS would be a lie. If it is running, returning
>> $OCF_NOT_RUNNING would be a lie.
>>
>> The monitor action can also be called by pacemaker even when the
>> resource is not running (ie, prior to starting it, or when pacemaker
>> first starts up). Thus returning $OCF_SUCCESS on error is not appropriate.
> So where is the problem? If the script returns "ERROR" than pacemaker has to 
> acct accordingly.
If the script returns "ERROR" the `on-fail=ignore` should make it do
nothing. Amazon's API failed, we need to just retry again later.
If the script returns "STOPPED", this isn't an error. The script queried
the resource, found it was stopped, and reported it as stopped.
Pacemaker should act accordingly and start it back up.

>
> Mit freundlichen Grüßen,
>
> Michael Schwartzkopff
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131206/4bd44b40/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 600 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131206/4bd44b40/attachment-0003.sig>