[Pacemaker] monitor operation stopped running

Wed Dec 15 07:30:47 UTC 2010

On Tue, 14 Dec 2010 18:55:06 +0100, Dejan Muhamedagic wrote:

> Hi,
> 
> On Tue, Dec 14, 2010 at 12:16:22PM +0200, Chris Picton wrote:
>> Hi
>> 
>> I have noticed this happening a few times on various of my clusters.
>> The monitor operation for some resources stops running, and thus
>> resource failures are not detected.  If I edit the cib, and change
>> something regarding the resource (generally I change the monitor
>> interval), the resource starts monitoring again, detects the failure
>> and restarts correctly
>> 
>> I am using pacemaker 1.0.9 live, and 1.0.10 in test.
>> 
>> This has happened with both clone and non-clone resources.
>> 
>> I have attached a log which shows the behaviour.  I have a resource
>> (megaswitch) running cloned over 6 nodes.
>> 
>> Until 06:48:22, the monitor is running correctly (the app logs the
>> "Deleting context for MONTEST-" line when the monitor is run) After
>> that, the monitor is not run again on this node
>> 
>> I have the logs for the other nodes, if they are needed to try and
>> debug this.
> 
> Nov 28 06:48:26 sbc-tpna2-01 crmd: [4863]: info: do_lrm_invoke: Removing
> resource megaswitch:3 from the LRM Nov 28 06:48:26 sbc-tpna2-01 crmd:
> [4863]: info: do_lrm_invoke: Resource 'megaswitch:3' deleted for
> 19511_crm_resource on sbc-tpna2-06.ecntelecoms.za.net Nov 28 06:48:26
> sbc-tpna2-01 crmd: [4863]: info: notify_deleted: Notifying
> 19511_crm_resource on sbc-tpna2-06.ecntelecoms.za.net that megaswitch:3
> was deleted
> 
> Somebody/something on sbc-tpna2-06.ecntelecoms.za.net ran crm_resource
> (or perhaps the crm shell) and removed megaswitch from LRM. Any
> suspicious cron jobs over there?

on sbc-tpna2-06
---------------
Nov 28 06:48:19 sbc-tpna2-06 crm_resource: [19476]: info: Invoked: 
crm_resource -C -r group_megaswitch:0 -H sbc-tpna2-01.ecntelecoms.za.net 
Nov 28 06:48:21 sbc-tpna2-06 crm_resource: [19482]: info: Invoked: 
crm_resource -C -r group_megaswitch:1 -H sbc-tpna2-01.ecntelecoms.za.net 
Nov 28 06:48:24 sbc-tpna2-06 crm_resource: [19506]: info: Invoked: 
crm_resource -C -r group_megaswitch:2 -H sbc-tpna2-01.ecntelecoms.za.net 
Nov 28 06:48:24 sbc-tpna2-06 crmd: [29893]: ERROR: send_msg_via_ipc: 
Unknown Sub-system (19482_crm_resource)... discarding message.
Nov 28 06:48:24 sbc-tpna2-06 crmd: [29893]: ERROR: send_msg_via_ipc: 
Unknown Sub-system (19482_crm_resource)... discarding message.
Nov 28 06:48:26 sbc-tpna2-06 crm_resource: [19511]: info: Invoked: 
crm_resource -C -r group_megaswitch:3 -H sbc-tpna2-01.ecntelecoms.za.net 
Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: write_cib_contents: 
Archived previous version as /var/lib/heartbeat/crm/cib-21.raw
Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: write_cib_contents: 
Wrote version 0.232.0 of the CIB to disk (digest: 
6aaa4d35d37a179b8f42c7045220690a)
Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: retrieveCib: Reading 
cluster configuration from: /var/lib/heartbeat/crm/cib.tmgWhm (digest: /
var/lib/heartbeat/crm/cib.NqXOtl)
Nov 28 06:48:27 sbc-tpna2-06 cib: [29889]: info: Managed 
write_cib_contents process 19512 exited with return code 0.
Nov 28 06:48:27 sbc-tpna2-06 attrd: [29892]: info: attrd_ha_callback: 
flush message from sbc-tpna2-01.ecntelecoms.za.net
Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: write_cib_contents: 
Archived previous version as /var/lib/heartbeat/crm/cib-22.raw
Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: write_cib_contents: 
Wrote version 0.233.0 of the CIB to disk (digest: 
8e39a0b125878ab28f8bed81789f5a59)
Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: retrieveCib: Reading 
cluster configuration from: /var/lib/heartbeat/crm/cib.mwt8EZ (digest: /
var/lib/heartbeat/crm/cib.hZ74d0)
Nov 28 06:48:27 sbc-tpna2-06 cib: [29889]: info: Managed 
write_cib_contents process 19527 exited with return code 0.
Nov 28 06:48:28 sbc-tpna2-06 crm_resource: [19528]: info: Invoked: 
crm_resource -C -r group_megaswitch:4 -H sbc-tpna2-01.ecntelecoms.za.net 
Nov 28 06:48:30 sbc-tpna2-06 crm_resource: [19534]: info: Invoked: 
crm_resource -C -r group_megaswitch:5 -H sbc-tpna2-01.ecntelecoms.za.net 


It looks like a 'crm resource cleanup megaswitch-clone' command was 
executed

On the other nodes, they all log  similar entries
---
sbc-tpna2-05.ecntelecoms.za.net.16.small:Nov 28 06:49:17 sbc-tpna2-05 
crmd: [30350]: info: do_lrm_invoke: Removing resource megaswitch:4 from 
the LRM
sbc-tpna2-05.ecntelecoms.za.net.16.small-Nov 28 06:49:17 sbc-tpna2-05 
crmd: [30350]: info: do_lrm_invoke: Resource 'megaswitch:4' deleted for 
19697_crm_resource on sbc-tpna2-06.ecntelecoms.za.net
sbc-tpna2-05.ecntelecoms.za.net.16.small-Nov 28 06:49:17 sbc-tpna2-05 
crmd: [30350]: info: notify_deleted: Notifying 19697_crm_resource on sbc-
tpna2-06.ecntelecoms.za.net that megaswitch:4 was deleted
--


So I have 2 questions:
1) Why would a resource cleanup remove the resource from the lrm, even 
though it is still running correctly, and the monitor operation are 
succeeding
2) How can I programatically detect and fix this state so I can get a 
cron in place for now to 'fix' it

Thanks for the help

Chris