[Pacemaker] Cluster frozen after "crm resource cleanup"

Tue Jun 8 15:28:20 UTC 2010

This morning, I wanted to do a "cleanup" on a "ping" resource (which
at the time was in a "started" state but had a fail-count of 3. After
that operation, the cluster didn't do any monitor operations and
refused to do anything else. Below is a short excerpt of the messages
file from the first node, the full logs are available as:

http://www.incertum.net/~cite/messages.mudslide1
http://www.incertum.net/~cite/messages.mudslide2

What happened there, and how do I recover?

Cheers
Stefan

Jun  8 09:14:12 mudslide1 crm_resource: [9954]: info: Invoked: crm_resource --meta -r cl-GW_ping -g clone-max 
Jun  8 09:14:12 mudslide1 cibadmin: [9955]: info: Invoked: cibadmin -Ql -o nodes 
Jun  8 09:14:12 mudslide1 crm_resource: [9956]: info: Invoked: crm_resource -C -r prim-GW_ping:0 -H mudslide1 
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_lrm_invoke: Removing resource prim-GW_ping:0 from the LRM
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: send_direct_ack: ACK'ing resource op prim-GW_ping:0_delete_60000 from 0:0:crm-resource-9956: lrm_invoke-lrmd-1275981252-56
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: lrm_remove_deleted_op: Removing op prim-GW_ping:0_monitor_20000:287 for deleted resource prim-GW_ping:0
Jun  8 09:14:12 mudslide1 cib: [9958]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-63.raw
Jun  8 09:14:12 mudslide1 cib: [9958]: info: write_cib_contents: Wrote version 0.68.0 of the CIB to disk (digest: 386d9b04ef43976a2041310aed2ad2c5)
Jun  8 09:14:12 mudslide1 cib: [9958]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.ctdEhc (digest: /var/lib/heartbeat/crm/cib.g5GfPa)
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_lrm_rsc_op: Performing key=30:607:7:55efa5bc-7399-4a63-94b5-51e3f6f583d4 op=prim-GW_ping:0_monitor_0 )
Jun  8 09:14:12 mudslide1 lrmd: [1445]: notice: on_msg_perform_op: resource prim-GW_ping:0 is frozen, no ops can run.
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_lrm_rsc_op: Operation monitor on prim-GW_ping:0 failed: -1
Jun  8 09:14:12 mudslide1 crmd: [1448]: WARN: do_log: FSA: Input I_FAIL from do_lrm_rsc_op() received in state S_NOT_DC
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_test was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_test:312 (prim-MySQL_db_test_monitor_60000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_test:313 (prim-MySQL_db_test_monitor_120000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_syslog was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_syslog:310 (prim-MySQL_db_syslog_monitor_60000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_syslog:311 (prim-MySQL_db_syslog_monitor_120000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_nagios was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_nagios:316 (prim-MySQL_db_nagios_monitor_60000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_nagios:317 (prim-MySQL_db_nagios_monitor_120000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_fax was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_fax:319 (prim-MySQL_db_fax_monitor_120000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_fax:318 (prim-MySQL_db_fax_monitor_60000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-DRBD_r0:0 was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-DRBD_r0:0:303 (prim-DRBD_r0:0_monitor_29000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_wiki was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_wiki:314 (prim-MySQL_db_wiki_monitor_60000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_wiki:315 (prim-MySQL_db_wiki_monitor_120000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-GW_ping:0 was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-Failover_IP was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-Failover_IP:304 (prim-Failover_IP_monitor_60000) incomplete at shutdown
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-FS_shared was active at shutdown.  You may ignore this error if it is unmanaged.
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_lrm_control: Disconnected from the LRM
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_ha_control: Disconnected from OpenAIS
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_cib_control: Disconnecting CIB
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: crmd_cib_connection_destroy: Connection to the CIB terminated...
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_exit: Could not recover from internal error
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_exit: [crmd] stopped (2)
Jun  8 09:14:12 mudslide1 cib: [1444]: WARN: send_ipc_message: IPC Channel to 1448 is not connected
Jun  8 09:14:12 mudslide1 cib: [1444]: WARN: send_via_callback_channel: Delivery of reply to client 1448/36aac1d1-2aac-478d-accf-7f224b16f451 failed
Jun  8 09:14:12 mudslide1 cib: [1444]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Jun  8 09:14:12 mudslide1 corosync[1433]:   [pcmk  ] info: pcmk_ipc_exit: Client crmd (conn=0x95a5008, async-conn=0x95a5008) left
Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=1448, rc=2)
Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] notice: pcmk_wait_dispatch: Respawning failed child process: crmd
Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] info: spawn_child: Forked child 9985 for process crmd
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: Invoked: /usr/lib/heartbeat/crmd 
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: main: CRM Hg Version: d3fa20fc76c7947d6de66db7e52526dc6bd7d782
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: crmd_init: Starting crmd
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: do_cib_control: CIB connection established
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: crm_cluster_connect: Connecting to OpenAIS
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: init_ais_connection: Creating connection to our AIS plugin
Jun  8 09:14:13 mudslide1 crmd: [9985]: info: init_ais_connection: AIS connection established
Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] info: pcmk_ipc: Recorded connection 0x95a5008 for crmd/9985
Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] info: pcmk_ipc: Sending membership update 24 to crmd