[Pacemaker] issues when installing on pxe booted environment

Fri Mar 29 00:37:37 UTC 2013

On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan <rainer.brestan at gmx.net> wrote:
> Hi John,
> to get Corosync/Pacemaker running during anaconda installation, i have
> created a configuration RPM package which does a few actions before starting
> Corosync and Pacemaker.
>
> An excerpt of the post install of this RPM.
> # mount /dev/shm if not already existing, otherwise openais cannot work
> if [ ! -d /dev/shm ]; then
>     mkdir /dev/shm
>     mount /dev/shm
> fi

Perhaps mention this to the corosync guys, it should probably go into
their init script.
I'd put it in pacemaker but thats likely too late.

> # resource agents might run as different user
> chmod -R go+rwx /var/lib/heartbeat/cores

I'm about to change the permissions to 775 for this.  Would that be sufficient?

    build_path(CRM_CORE_DIR, 0755);
    mcp_chown(CRM_CORE_DIR, pcmk_uid, pcmk_gid);

>
> Rainer
>
> Gesendet: Donnerstag, 28. März 2013 um 00:46 Uhr
> Von: "Andrew Beekhof" <andrew at beekhof.net>
> An: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Betreff: Re: [Pacemaker] issues when installing on pxe booted environment
> What about /dev/shm ?
> Libqb tries to create some shared memory in that location by default.
>
> On Thu, Mar 28, 2013 at 8:50 AM, John White <jwhite at lbl.gov> wrote:
>> Yup:
>> -bash-4.1$ cd /var/run/crm/
>> -bash-4.1$ ls
>> lost+found pcmk pengine st_callback st_command
>> -bash-4.1$ touch blah
>> -bash-4.1$ ls -l
>> total 16
>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
>> drwx------ 2 root root 16384 Mar 14 15:00 lost+found
>> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk
>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback
>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command
>> -bash-4.1$ ls -l /var/run/| grep crm
>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
>> -bash-4.1$ whoami
>> hacluster
>> -bash-4.1$
>> ----------------
>> John White
>> HPC Systems Engineer
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50C-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>>
>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andreas at hastexo.com> wrote:
>>
>>> On 2013-03-22 19:31, John White wrote:
>>>> Hello Folks,
>>>> We're trying to get a corosync/pacemaker instance going on a 4 node
>>>> cluster that boots via pxe. There have been a number of state/file system
>>>> issues, but those appear to be *mostly* taken care of thus far. We're
>>>> running into an issue now where cib just isn't staying up with errors akin
>>>> to the following (sorry for the lengthy dump, note the attrd and cib
>>>> connection errors). Any ideas would be greatly appreciated:
>>>>
>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng:
>>>> Creating RNG parser context
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked:
>>>> /usr/lib64/heartbeat/attrd
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster
>>>> type is: 'corosync'
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect:
>>>> Connecting to cluster infrastructure: corosync
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could
>>>> not connect to the Cluster Process Group API: 2
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection
>>>> active
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute
>>>> updates
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked:
>>>> /usr/lib64/heartbeat/pengine
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker:
>>>> Changed active directory to /var/lib/heartbeat/cores/hacluster
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old
>>>> instances of pengine
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug:
>>>> init_client_ipc_comms_nodispatch: Attempting to talk on:
>>>> /var/run/crm/pengine
>>>
>>> That "/var/run/crm" directory is available and owned by
>>> hacluster.haclient ... and writable by at least the hacluster user?
>>>
>>> Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child
>>>> process attrd exited (pid=25841, rc=100)
>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit:
>>>> Child process attrd no longer wishes to be respawned
>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes:
>>>> Node n0014.lustre now has process list: 00000000000000000000000000110312
>>>> (was 00000000000000000000000000111312)
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug:
>>>> init_client_ipc_comms_nodispatch: Could not init comms on:
>>>> /var/run/crm/pengine
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection:
>>>> Adding fd=4 to mainloop
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info:
>>>> init_ais_connection_once: Connection to 'corosync': established
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating
>>>> entry for node n0014.lustre/247988234
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node
>>>> n0014.lustre now has id: 247988234
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node
>>>> 247988234 is now known as n0014.lustre
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug:
>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked:
>>>> /usr/lib64/heartbeat/crmd
>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect:
>>>> Channel 0x995530 connected: 1 children
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting
>>>> stonith-ng mainloop
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version:
>>>> a02c0f19a00c1eb2527ad38f146ebc0834814558
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing
>>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action:
>>>> actions:trace: #011// A_LOG
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action:
>>>> actions:trace: #011// A_STARTUP
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering
>>>> Signal Handlers
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and
>>>> LRM objects
>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node
>>>> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0
>>>> proc=00000000000000000000000000110312 (new)
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler:
>>>> Added signal handler for signal 17
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action:
>>>> actions:trace: #011// A_CIB_START
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug:
>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug:
>>>> init_client_ipc_comms_nodispatch: Could not init comms on:
>>>> /var/run/crm/cib_rw
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw:
>>>> Connection to command channel failed
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug:
>>>> init_client_ipc_comms_nodispatch: Attempting to talk on:
>>>> /var/run/crm/cib_callback
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug:
>>>> init_client_ipc_comms_nodispatch: Could not init comms on:
>>>> /var/run/crm/cib_callback
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw:
>>>> Connection to callback channel failed
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw:
>>>> Connection to CIB failed: connection failed
>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing
>>>> out of the CIB Service
>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to
>>>> validate content
>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not
>>>> validate with <null>
>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization
>>>> completed successfully
>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type
>>>> is: 'corosync'
>>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect:
>>>> Connecting to cluster infrastructure: corosync
>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could
>>>> not connect to the Cluster Process Group API: 2
>>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to
>>>> the cluster... terminating
>>>>
>>>>
>>>> ----------------
>>>> John White
>>>> HPC Systems Engineer
>>>> (510) 486-7307
>>>> One Cyclotron Rd, MS: 50C-3209C
>>>> Lawrence Berkeley National Lab
>>>> Berkeley, CA 94720
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>