[Pacemaker] issues when installing on pxe booted environment

Andrew Beekhof andrew at beekhof.net
Wed Mar 27 23:46:19 UTC 2013


What about /dev/shm ?
Libqb tries to create some shared memory in that location by default.

On Thu, Mar 28, 2013 at 8:50 AM, John White <jwhite at lbl.gov> wrote:
> Yup:
> -bash-4.1$ cd /var/run/crm/
> -bash-4.1$ ls
> lost+found  pcmk  pengine  st_callback  st_command
> -bash-4.1$ touch blah
> -bash-4.1$ ls -l
> total 16
> -rw-r--r-- 1 hacluster haclient     0 Mar 27 14:50 blah
> drwx------ 2 root      root     16384 Mar 14 15:00 lost+found
> srwxrwxrwx 1 root      root         0 Mar 22 11:25 pcmk
> srwxrwxrwx 1 hacluster root         0 Mar 22 11:25 pengine
> srwxrwxrwx 1 root      root         0 Mar 22 11:25 st_callback
> srwxrwxrwx 1 root      root         0 Mar 22 11:25 st_command
> -bash-4.1$ ls -l /var/run/| grep crm
> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
> -bash-4.1$ whoami
> hacluster
> -bash-4.1$
> ----------------
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andreas at hastexo.com> wrote:
>
>> On 2013-03-22 19:31, John White wrote:
>>> Hello Folks,
>>>      We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe.  There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far.  We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors).  Any ideas would be greatly appreciated:
>>>
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync'
>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
>>
>> That "/var/run/crm" directory is available and owned by
>> hacluster.haclient ... and writable by at least the hacluster user?
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100)
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312)
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new)
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to command channel failed
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_callback
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_callback
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to callback channel failed
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out of the CIB Service
>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate content
>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not validate with <null>
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization completed successfully
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 'corosync'
>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the cluster... terminating
>>>
>>>
>>> ----------------
>>> John White
>>> HPC Systems Engineer
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50C-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list