[Pacemaker] issues when installing on pxe booted environment
John White
jwhite at lbl.gov
Thu Apr 11 20:39:25 UTC 2013
Ah, /dev/shm had root:root user writable only. Opening it up seems to have kicked something the right way. Thanks folks.
----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720
On Apr 11, 2013, at 1:37 PM, John White <jwhite at lbl.gov> wrote:
> Yep, we've definitely got /dev/shm (this was done to fix an earlier problem).
> ----------------
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> On Mar 27, 2013, at 4:46 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>> What about /dev/shm ?
>> Libqb tries to create some shared memory in that location by default.
>>
>> On Thu, Mar 28, 2013 at 8:50 AM, John White <jwhite at lbl.gov> wrote:
>>> Yup:
>>> -bash-4.1$ cd /var/run/crm/
>>> -bash-4.1$ ls
>>> lost+found pcmk pengine st_callback st_command
>>> -bash-4.1$ touch blah
>>> -bash-4.1$ ls -l
>>> total 16
>>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
>>> drwx------ 2 root root 16384 Mar 14 15:00 lost+found
>>> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk
>>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
>>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback
>>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command
>>> -bash-4.1$ ls -l /var/run/| grep crm
>>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
>>> -bash-4.1$ whoami
>>> hacluster
>>> -bash-4.1$
>>> ----------------
>>> John White
>>> HPC Systems Engineer
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50C-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>>>
>>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andreas at hastexo.com> wrote:
>>>
>>>> On 2013-03-22 19:31, John White wrote:
>>>>> Hello Folks,
>>>>> We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated:
>>>>>
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync'
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
>>>>
>>>> That "/var/run/crm" directory is available and owned by
>>>> hacluster.haclient ... and writable by at least the hacluster user?
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>>
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100)
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312)
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new)
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to command channel failed
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_callback
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_callback
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to callback channel failed
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out of the CIB Service
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate content
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not validate with <null>
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization completed successfully
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 'corosync'
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the cluster... terminating
>>>>>
>>>>>
>>>>> ----------------
>>>>> John White
>>>>> HPC Systems Engineer
>>>>> (510) 486-7307
>>>>> One Cyclotron Rd, MS: 50C-3209C
>>>>> Lawrence Berkeley National Lab
>>>>> Berkeley, CA 94720
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list