[Pacemaker] issues when installing on pxe booted environment

John White jwhite at lbl.gov
Fri Mar 22 14:31:13 EDT 2013


Hello Folks,
	We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe.  There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far.  We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors).  Any ideas would be greatly appreciated: 

Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context
Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd 
Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync'
Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine 
Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine
Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100)
Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312)
Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop
Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558
Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG   
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new)
Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to command channel failed
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_callback
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_callback
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to callback channel failed
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out of the CIB Service
Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate content
Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not validate with <null>
Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization completed successfully
Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 'corosync'
Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the cluster... terminating


----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720





More information about the Pacemaker mailing list