[Pacemaker] Monitor ops do not get cancelled

Tue Sep 28 09:37:02 UTC 2010

On Thu, Sep 23, 2010 at 8:49 PM, Phil Armstrong <pma at sgi.com> wrote:
> I posted earlier asking for help because I had a primitive whose monitor
> operation was not getting canceled at the time that a manual relocation was
> performed. I updated pacemaker (as was suggested) to pacemaker-1.1.2-0.6.1
> which is the latest I could find for an IA64 platform without having to
> build from source. If anyone knows of a later IA64 binary version I would
> appreciate that information.

1.1.3 came out the other day.
which distro are you using?

>
> The monitor problem persisted after the upgrade, though the error messages I
> was seeing earlier were no longer present. They were apparently unrelated.
> Painful trial and error lead me to the conclusion that it was the
> primitive's start-op timeout and monitor-op start-delay values. When I had
> these values set at 480s, the monitor-op did not get canceled for a manual
> relocation and so would get rescheduled after the relocation only to find
> the resource not operational (it had been relocated) and thus set the
> fail-count to non-zero, fencing the resource. If I set the values to 240s,
> everything went smoothly and the monitor-op was canceled.
>
> As a test, I changed a different primitive's values to 480s and that
> primitive then displayed the failing behavior.
>
> If anyone knows why this might be the case (perhaps there are rules I am
> unaware of that prohibit larger values) I would appreciate the information.
> If not, I guess I should will a bug.
>
> Thanks for any help in advance.

Hmmmm, which version of cluster-glue do you have?
This sounds like it might be related to

dejan ()	High: LRM: lrmd: don't allow cancelled operations to get back
to the repeating op list (lf#2417) CS: fc141b7e1e19 On: 2010-06-10

which first appeared in cluster-glue 1.0.6 IIRC