[Pacemaker] monitor operation stopped running

Fri Dec 17 09:56:53 UTC 2010

On Thu, 16 Dec 2010 08:27:51 +0100, Andrew Beekhof wrote:

> On Wed, Dec 15, 2010 at 8:30 AM, Chris Picton

>> Why would a resource cleanup remove the resource from the lrm, even
>> though it is still running correctly,
> 
> Thats what cleanup does.
> What is supposed to happen next however, is that the cluster runs a
> non-recurring monitor operation to re-determine the current state of the
> cluster and go from there.
> Also, any recurring actions should have been cancelled at the point the
> resource was removed from the lrm.
> 
> What versions of pacemaker and cluster-glue do you have?  Distro?
> 

I am using the clusterlabs rpms
pacemaker-1.0.9.1-1.15.el5
cluster-glue-1.0.6-1.6.el5

I see the following in the output of mon_mon -rf1t (I'm only showing the 
resources which are showing rc != 0)
* Node sbc-tpna2-06.ecntelecoms.za.net:  pingd=100
   megaswitch:5: migration-threshold=1000000
    + (53) probe: last-rc-change='Fri Nov 26 09:17:38 2010' last-run='Fri 
Nov 26 09:17:38 2010' exec-time=30ms queue-time=0ms rc=1 (unknown error)
    + (55) stop: last-rc-change='Fri Nov 26 09:17:41 2010' last-run='Fri 
Nov 26 09:17:41 2010' exec-time=20ms queue-time=0ms rc=0 (ok)
    + (56) start: last-rc-change='Fri Nov 26 09:17:42 2010' last-run='Fri 
Nov 26 09:17:42 2010' exec-time=1040ms queue-time=0ms rc=0 (ok)
    + (57) monitor: interval=8000ms last-rc-change='Fri Nov 26 09:17:44 
2010' last-run='Fri Nov 26 09:17:44 2010' exec-time=260ms queue-time=0ms 
rc=0 (ok)
* Node sbc-tpna2-05.ecntelecoms.za.net:  pingd=100
   megaswitch:4: migration-threshold=1000000
    + (58) probe: last-rc-change='Fri Nov 26 09:17:38 2010' last-run='Fri 
Nov 26 09:17:38 2010' exec-time=30ms queue-time=0ms rc=1 (unknown error)
    + (60) stop: last-rc-change='Fri Nov 26 09:17:41 2010' last-run='Fri 
Nov 26 09:17:41 2010' exec-time=20ms queue-time=0ms rc=0 (ok)
    + (61) start: last-rc-change='Fri Nov 26 09:17:42 2010' last-run='Fri 
Nov 26 09:17:42 2010' exec-time=1040ms queue-time=0ms rc=0 (ok)
    + (62) monitor: interval=8000ms last-rc-change='Fri Nov 26 09:17:44 
2010' last-run='Fri Nov 26 09:17:44 2010' exec-time=260ms queue-time=0ms 
rc=0 (ok)

Would this affect the result of the 'non-recurring monitor 
operation' (the probe operations having rc=1)

I am not 100% sure why the errors are there - the log on the server for 
that day shows:
----
Nov 26 09:17:39 sbc-tpna2-06 crmd: [29893]: info: do_lrm_rsc_op: 
Performing key=36:2184:7:c83a06e0-913e-4546-92e5-19f784dcaf5c 
op=megaswitch:5_monitor_0 )
Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: info: rsc:megaswitch:5:53: 
probe
Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: WARN: Managed 
megaswitch:5:monitor process 24823 exited with return code 1.
Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: WARN: Managed 
megaswitch:5:monitor process 24823 exited with return code 1.
Nov 26 09:17:39 sbc-tpna2-06 crmd: [29893]: info: process_lrm_event: LRM 
operation megaswitch:5_monitor_0 (call=53, rc=1, cib-update=68, 
confirmed=true) unknown error
----

If they are affecting it, how would I clear them, so pacemaker sees 
everything as OK?

Thanks for the help

Chris