[Pacemaker] Cluster frozen after "crm resource cleanup"

Tue Jun 8 12:17:18 EDT 2010

Hi,

On Tue, Jun 08, 2010 at 05:28:20PM +0200, Stefan Foerster wrote:
> This morning, I wanted to do a "cleanup" on a "ping" resource (which
> at the time was in a "started" state but had a fail-count of 3. After
> that operation, the cluster didn't do any monitor operations and
> refused to do anything else. Below is a short excerpt of the messages
> file from the first node, the full logs are available as:
> 
> http://www.incertum.net/~cite/messages.mudslide1
> http://www.incertum.net/~cite/messages.mudslide2
> 
> What happened there,

This seems to be the same (or similar) issue like the one
reported just a few days ago here
http://www.gossamer-threads.com/lists/linuxha/pacemaker/64628

Here crmd restarted too. The other node had some problems with
the bad input:

Jun  8 09:14:14 mudslide2 crmd: [4501]: WARN: log_data_element: do_lrm_invoke: bad input <create_request_adv origin="send_lrm_rsc_op" t="crmd" version="3.0.1" subt="request" reference="lrm_delete-crm_resource-1275981254-1" crm_task="lrm_delete" crm_sys_to="lrmd" crm_sys_from="9993_crm_resource" crm_host_to="mudslide2" src="mudslide1" seq="4" >

Please make a hb_report for this incident and open a bugzilla.

> and how do I recover?

Looks like you the easiest way is to restart openais on both
nodes. If you want to keep the services running, first set all
resource to the unmanaged mode (set maintenance-mode to true).

Thanks,

Dejan

> Cheers
> Stefan
> 
> 
> Jun  8 09:14:12 mudslide1 crm_resource: [9954]: info: Invoked: crm_resource --meta -r cl-GW_ping -g clone-max 
> Jun  8 09:14:12 mudslide1 cibadmin: [9955]: info: Invoked: cibadmin -Ql -o nodes 
> Jun  8 09:14:12 mudslide1 crm_resource: [9956]: info: Invoked: crm_resource -C -r prim-GW_ping:0 -H mudslide1 
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_lrm_invoke: Removing resource prim-GW_ping:0 from the LRM
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: send_direct_ack: ACK'ing resource op prim-GW_ping:0_delete_60000 from 0:0:crm-resource-9956: lrm_invoke-lrmd-1275981252-56
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: lrm_remove_deleted_op: Removing op prim-GW_ping:0_monitor_20000:287 for deleted resource prim-GW_ping:0
> Jun  8 09:14:12 mudslide1 cib: [9958]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-63.raw
> Jun  8 09:14:12 mudslide1 cib: [9958]: info: write_cib_contents: Wrote version 0.68.0 of the CIB to disk (digest: 386d9b04ef43976a2041310aed2ad2c5)
> Jun  8 09:14:12 mudslide1 cib: [9958]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.ctdEhc (digest: /var/lib/heartbeat/crm/cib.g5GfPa)
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_lrm_rsc_op: Performing key=30:607:7:55efa5bc-7399-4a63-94b5-51e3f6f583d4 op=prim-GW_ping:0_monitor_0 )
> Jun  8 09:14:12 mudslide1 lrmd: [1445]: notice: on_msg_perform_op: resource prim-GW_ping:0 is frozen, no ops can run.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_lrm_rsc_op: Operation monitor on prim-GW_ping:0 failed: -1
> Jun  8 09:14:12 mudslide1 crmd: [1448]: WARN: do_log: FSA: Input I_FAIL from do_lrm_rsc_op() received in state S_NOT_DC
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_test was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_test:312 (prim-MySQL_db_test_monitor_60000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_test:313 (prim-MySQL_db_test_monitor_120000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_syslog was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_syslog:310 (prim-MySQL_db_syslog_monitor_60000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_syslog:311 (prim-MySQL_db_syslog_monitor_120000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_nagios was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_nagios:316 (prim-MySQL_db_nagios_monitor_60000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_nagios:317 (prim-MySQL_db_nagios_monitor_120000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_fax was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_fax:319 (prim-MySQL_db_fax_monitor_120000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_fax:318 (prim-MySQL_db_fax_monitor_60000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-DRBD_r0:0 was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-DRBD_r0:0:303 (prim-DRBD_r0:0_monitor_29000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-MySQL_db_wiki was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_wiki:314 (prim-MySQL_db_wiki_monitor_60000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-MySQL_db_wiki:315 (prim-MySQL_db_wiki_monitor_120000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-GW_ping:0 was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-Failover_IP was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: notice: ghash_print_pending_for_rsc: Recurring action prim-Failover_IP:304 (prim-Failover_IP_monitor_60000) incomplete at shutdown
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: verify_stopped: Resource prim-FS_shared was active at shutdown.  You may ignore this error if it is unmanaged.
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_lrm_control: Disconnected from the LRM
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_ha_control: Disconnected from OpenAIS
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_cib_control: Disconnecting CIB
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: crmd_cib_connection_destroy: Connection to the CIB terminated...
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
> Jun  8 09:14:12 mudslide1 crmd: [1448]: ERROR: do_exit: Could not recover from internal error
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
> Jun  8 09:14:12 mudslide1 crmd: [1448]: info: do_exit: [crmd] stopped (2)
> Jun  8 09:14:12 mudslide1 cib: [1444]: WARN: send_ipc_message: IPC Channel to 1448 is not connected
> Jun  8 09:14:12 mudslide1 cib: [1444]: WARN: send_via_callback_channel: Delivery of reply to client 1448/36aac1d1-2aac-478d-accf-7f224b16f451 failed
> Jun  8 09:14:12 mudslide1 cib: [1444]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
> Jun  8 09:14:12 mudslide1 corosync[1433]:   [pcmk  ] info: pcmk_ipc_exit: Client crmd (conn=0x95a5008, async-conn=0x95a5008) left
> Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=1448, rc=2)
> Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] notice: pcmk_wait_dispatch: Respawning failed child process: crmd
> Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] info: spawn_child: Forked child 9985 for process crmd
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: Invoked: /usr/lib/heartbeat/crmd 
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: main: CRM Hg Version: d3fa20fc76c7947d6de66db7e52526dc6bd7d782
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: crmd_init: Starting crmd
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: G_main_add_SignalHandler: Added signal handler for signal 17
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: do_cib_control: CIB connection established
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: crm_cluster_connect: Connecting to OpenAIS
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: init_ais_connection: Creating connection to our AIS plugin
> Jun  8 09:14:13 mudslide1 crmd: [9985]: info: init_ais_connection: AIS connection established
> Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] info: pcmk_ipc: Recorded connection 0x95a5008 for crmd/9985
> Jun  8 09:14:13 mudslide1 corosync[1433]:   [pcmk  ] info: pcmk_ipc: Sending membership update 24 to crmd
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker