[Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable

Tue Dec 18 01:18:24 EST 2012

On Tue, Dec 18, 2012 at 4:24 PM, pavan tc <pavan.tc at gmail.com> wrote:
> [..]
>
>
>> The idea is to make sure that stop does not fail when the underlying
>>
>> > resource goes away.
>> > (Otherwise I see that the resource gets to an unmanaged state)
>> > Also, the expectation is that when the resource comes back, it joins the
>> > cluster without much fuss.
>> >
>> > What I see is that pacemaker calls stop twice
>>
>> That would not be expected. Bug?
>
>
> Are you pointing at stop getting called 'twice'?

Correct

> If yes, I will confirm once
> more about
> the behaviour and will raise a bug.
>
>>
>>
>> > and if it finds that stop
>> > returns success,
>> > it does not continue with monitor any more. I also do not see an attempt
>> > to
>> > start.
>>
>> Anywhere?  Or just on the same node?
>>
>
> On the same node. The resource does get promoted on the other node.
> My expectation was that if I kept returning OCF_NOT_RUNNING in monitor,
> then it should attempt a start-stop-monitor cycle till the resource came
> back.
> It seems this is not what the cluster manager does?

Not always, it very much depends on the constraints you've defined and
things like migration-threshold.

>
>> >
>> > Is there a way to keep the monitor going in such circumstances?
>>
>> Not really. You can define a recurring monitor for the Stopped role
>> though.
>
>
> I did not want to go there if I could achieve it via the usual mechanisms.

If you want to monitor a resource on a node that its not running on,
that _is_ the usual mechanism.
The thing is that it's an unusual thing to want to do.

> If that is not, possible, I will explore this option in more detail.
>
>> But why would it come back?  You _really_ should not be starting
>> services outside of the cluster - not least of all because we've
>> probably started it somewhere else in the meantime.
>
>
> Even if we started the resource elsewhere, we are running in degraded mode.

Not on the node for which you returned "stopped".
There you are just flat-out not running at all.

> (My bad, I did not mention this is a _two-node_ multi-state resource).
> We would like to come back to the available mode as early as possible and
> with the least amount of manual intervention with the cluster.

Normally I wouldn't expect any manual intervention either, but I
really can't comment further without seeing logs and configs.