[Pacemaker] error: do_exit: Could not recover from internal error
Andrew Beekhof
andrew at beekhof.net
Wed May 22 23:05:05 UTC 2013
On 22/05/2013, at 9:44 PM, Brian J. Murrell <brian at interlinx.bc.ca> wrote:
> Using pacemaker 1.1.8-7 on EL6, I got the following series of events
> trying to shut down pacemaker and then corosync. The corosync shutdown
> (service corosync stop) ended up spinning/hanging indefinitely (~7hrs
> now). The events, including a:
>
> May 21 23:47:18 node1 crmd[17598]: error: do_exit: Could not recover from internal error
>
> For completeness, I've included the logging from the whole session,
> from corosync startup until it all went haywire. The badness seems to
> start at about 23:43:18.
>
> May 21 23:42:51 node1 corosync[17541]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
> May 21 23:42:51 node1 corosync[17541]: [MAIN ] Corosync built-in features: nss dbus rdma snmp
> May 21 23:42:51 node1 corosync[17541]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
> May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transport (UDP/IP Multicast).
> May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transport (UDP/IP Multicast).
> May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> May 21 23:42:51 node1 corosync[17541]: [TOTEM ] The network interface [192.168.122.253] is now up.
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: process_ais_conf: Reading configure
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
>
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN
Can I take a brief moment to highlight those last two lines...
And to also refer you to http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/ which talks about issues with fencing for plugin-based clusters.
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: Local handle: 5650605097994944513 for logging
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: Processing additional logging options...
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found 'off' for option: debug
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found 'no' for option: to_logfile
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: Local handle: 2730409743423111170 for quorum
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: No additional configuration supplied for: quorum
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: No default for option: provider
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: Local handle: 5880381755227111427 for service
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: Processing additional service options...
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found '1' for option: ver
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: process_ais_conf: Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: CRM: Initialized
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] Logging: Initialized pcmk_startup
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: Service: 9
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: Local hostname: node1
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_update_nodeid: Local node id: 4252674240
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Creating entry for node 4252674240 born on 0
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c4640 Node 4252674240 now known as node1 (was: (null))
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Node node1 now has 1 quorum votes (was 0)
> May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.1.8
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync configuration service
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync profile loading service
> May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
> May 21 23:42:51 node1 corosync[17541]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
> May 21 23:42:51 node1 corosync[17541]: [TOTEM ] The network interface [10.0.0.253] is now up.
> May 21 23:42:52 node1 corosync[17541]: [TOTEM ] Incrementing problem counter for seqid 1 iface 10.0.0.253 to [1 of 10]
> May 21 23:42:52 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 4: memb=0, new=0, lost=0
> May 21 23:42:52 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 4: memb=1, new=1, lost=0
> May 21 23:42:52 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240
> May 21 23:42:52 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240
> May 21 23:42:52 node1 corosync[17541]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
> May 21 23:42:52 node1 corosync[17541]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:0 left:0)
> May 21 23:42:52 node1 corosync[17541]: [MAIN ] Completed service synchronization, ready to provide service.
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 8: memb=1, new=0, lost=0
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: memb: node1 4252674240
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 8: memb=2, new=1, lost=0
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Creating entry for node 2608507072 born on 8
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Node 2608507072/unknown is now: member
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2608507072
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2608507072
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: send_member_notification: Sending membership update 8 to 0 children
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c4640 Node 4252674240 ((null)) born on: 8
> May 21 23:42:54 node1 corosync[17541]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c8b50 Node 2608507072 (node2) born on: 8
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c8b50 Node 2608507072 now known as node2 (was: (null))
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Node node2 now has 1 quorum votes (was 0)
> May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: send_member_notification: Sending membership update 8 to 0 children
> May 21 23:42:54 node1 corosync[17541]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) r(1) ip(10.0.0.155) ; members(old:1 left:0)
> May 21 23:42:54 node1 corosync[17541]: [MAIN ] Completed service synchronization, ready to provide service.
> May 21 23:42:54 node1 corosync[17541]: [TOTEM ] ring 1 active with no faults
> May 21 23:42:56 node1 pacemakerd[17587]: notice: main: Starting Pacemaker 1.1.8-7.wc1.el6 (Build: 394e906): generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing corosync-plugin cman
> May 21 23:42:56 node1 pacemakerd[17587]: notice: get_local_node_name: Defaulting to uname(2).nodename for the local classic openais (with plugin) node name
> May 21 23:42:56 node1 pacemakerd[17587]: notice: update_node_processes: 0x1ec7cc0 Node 4252674240 now known as node1, was:
> May 21 23:42:56 node1 stonith-ng[17594]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:56 node1 cib[17593]: notice: main: Using new config location: /var/lib/pacemaker/cib
> May 21 23:42:56 node1 cib[17593]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml
> May 21 23:42:56 node1 cib[17593]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backup...
> May 21 23:42:56 node1 cib[17593]: warning: readCibXmlFile: Continuing with an empty configuration.
> May 21 23:42:56 node1 attrd[17596]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12d6560 for stonith-ng/0
> May 21 23:42:56 node1 crmd[17598]: notice: main: CRM Git Version: 394e906
> May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12da8e0 for attrd/0
> May 21 23:42:56 node1 attrd[17596]: notice: main: Starting mainloop...
> May 21 23:42:56 node1 cib[17593]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12df4c0 for cib/0
> May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Sending membership update 8 to cib
> May 21 23:42:56 node1 cib[17593]: notice: ais_dispatch_message: Membership 8: quorum acquired
> May 21 23:42:56 node1 cib[17593]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now member
> May 21 23:42:56 node1 cib[17593]: notice: crm_update_peer_state: crm_update_ais_node: Node node2[2608507072] - state is now member
> May 21 23:42:57 node1 pacemakerd[17587]: notice: update_node_processes: 0x1ec8c20 Node 2608507072 now known as node2, was:
> May 21 23:42:57 node1 stonith-ng[17594]: notice: setup_cib: Watching for stonith topology changes
> May 21 23:42:57 node1 crmd[17598]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:57 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12e4340 for crmd/0
> May 21 23:42:57 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Sending membership update 8 to crmd
> May 21 23:42:57 node1 crmd[17598]: notice: ais_dispatch_message: Membership 8: quorum acquired
> May 21 23:42:57 node1 crmd[17598]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now member
> May 21 23:42:57 node1 crmd[17598]: notice: crm_update_peer_state: crm_update_ais_node: Node node2[2608507072] - state is now member
> May 21 23:42:57 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving messages from a node we think is dead: node1[-42293056]
> May 21 23:42:57 node1 crmd[17598]: notice: do_started: The local CRM is operational
> May 21 23:42:58 node1 cib[17593]: warning: cib_process_diff: Diff 0.0.0 -> 0.0.1 from node2 not applied to 0.0.1: current "num_updates" is greater than required
> May 21 23:42:58 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving messages from a node we think is dead: node2[-1686460224]
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=0
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: memb: node1 4252674240
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 12: memb=3, new=1, lost=0
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: update_member: Creating entry for node 2474289344 born on 12
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: update_member: Node 2474289344/unknown is now: member
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2474289344
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2474289344
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240
> May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: send_member_notification: Sending membership update 12 to 2 children
> May 21 23:43:01 node1 corosync[17541]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
> May 21 23:43:01 node1 crmd[17598]: notice: crm_update_peer_state: crm_update_ais_node: Node (null)[2474289344] - state is now member
> May 21 23:43:01 node1 cib[17593]: notice: crm_update_peer_state: crm_update_ais_node: Node (null)[2474289344] - state is now member
> May 21 23:43:18 node1 crmd[17598]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> May 21 23:45:18 node1 crmd[17598]: error: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms)
> May 21 23:45:18 node1 crmd[17598]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: Diff: --- 0.0.3
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: Diff: +++ 0.1.1 2d896a081763a7b2ca54f82877fbbc84
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" epoch="0" num_updates="3" />
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <cluster_property_set id="cib-bootstrap-options" >
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.wc1.el6-394e906" />
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ </cluster_property_set>
> May 21 23:45:18 node1 cib[17593]: notice: log_cib_diff: cib:diff: Local-only Change: 0.2.1
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" epoch="1" num_updates="1" />
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="classic openais (with plugin)" />
> May 21 23:45:18 node1 cib[17593]: notice: log_cib_diff: cib:diff: Local-only Change: 0.3.1
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" epoch="2" num_updates="1" />
> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2" />
> May 21 23:46:33 node1 pacemakerd[17587]: notice: pcmk_shutdown_worker: Shuting down Pacemaker
> May 21 23:46:33 node1 pacemakerd[17587]: notice: stop_child: Stopping crmd: Sent -15 to process 17598
> May 21 23:46:33 node1 crmd[17598]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
> May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_ERROR from feature_update_callback() received in state S_INTEGRATION
> May 21 23:47:18 node1 crmd[17598]: warning: do_state_transition: State transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=feature_update_callback ]
> May 21 23:47:18 node1 crmd[17598]: error: do_recover: Action A_RECOVER (0000000001000000) not supported
> May 21 23:47:18 node1 crmd[17598]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
> May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
> May 21 23:47:18 node1 crmd[17598]: notice: terminate_cs_connection: Disconnecting from Corosync
> May 21 23:47:18 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x12e4340, async-conn=0x12e4340) left
> May 21 23:47:18 node1 crmd[17598]: error: do_exit: Could not recover from internal error
> May 21 23:47:18 node1 pacemakerd[17587]: error: pcmk_child_exit: Child process crmd exited (pid=17598, rc=2)
> May 21 23:47:33 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
> May 21 23:47:33 node1 pacemakerd[17587]: notice: stop_child: Stopping pengine: Sent -15 to process 17597
> May 21 23:47:48 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
> May 21 23:47:48 node1 pacemakerd[17587]: notice: stop_child: Stopping attrd: Sent -15 to process 17596
> May 21 23:47:48 node1 attrd[17596]: notice: main: Exiting...
> May 21 23:47:48 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client attrd (conn=0x12da8e0, async-conn=0x12da8e0) left
> May 21 23:48:03 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
> May 21 23:48:03 node1 pacemakerd[17587]: notice: stop_child: Stopping lrmd: Sent -15 to process 17595
> May 21 23:48:18 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
> May 21 23:48:18 node1 pacemakerd[17587]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 17594
> May 21 23:48:18 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client stonith-ng (conn=0x12d6560, async-conn=0x12d6560) left
> May 21 23:48:33 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
> May 21 23:48:33 node1 pacemakerd[17587]: notice: stop_child: Stopping cib: Sent -15 to process 17593
> May 21 23:48:38 node1 cib[17593]: notice: cib_force_exit: Forcing exit!
> May 21 23:48:38 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client cib (conn=0x12df4c0, async-conn=0x12df4c0) left
> May 21 23:48:38 node1 pacemakerd[17587]: error: pcmk_child_exit: Child process cib exited (pid=17593, rc=64)
> May 21 23:48:53 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
> May 21 23:48:53 node1 pacemakerd[17587]: notice: pcmk_shutdown_worker: Shutdown complete
>
> At this point I tried to shut down corosync as described above but it
> failed to shut down and just hung/spun in shutdown.
It looks to me like pacemaker shut down...
Is the other node in this cluster running the same version of pacemaker?
>
> Any ideas what went wrong here?
It seems to have started with
> May 21 23:42:58 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving messages from a node we think is dead: node2[-1686460224]
which led to
> May 21 23:45:18 node1 crmd[17598]: error: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms)
and
> May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_ERROR from feature_update_callback() received in state S_INTEGRATION
This is not a happy cluster. Are these real machines or VMs?
Firewall config ok?
Also, 1.1.8-7 was not tested with the plugin _at_all_ (and neither will future RHEL builds).
I highly recommend moving to cman for RHEL6
Check out the quickstart: http://clusterlabs.org/quickstart-redhat.html
More information about the Pacemaker
mailing list