[Pacemaker] solaris problem

Grüninger, Andreas (LGL Extern) Andreas.Grueninger at lgl.bwl.de
Mon Mar 25 12:35:37 EDT 2013


Andrei

There is no need to make this change.

I described in 
http://grueni.github.com/libqb/ 
how I compiled libqb and the other programs.

LOCALSTATEDIR should be defined with ./configure.
Please look a "Compile Corosync" in my description.

I guess your start scripts should be changed.

We use this as start script called by the smf instance
######################
#!/usr/bin/bash
# Start/stop HACluster service
#
. /lib/svc/share/smf_include.sh

## Tracing mit debug version
# PCMK_trace_files=1
# PCMK_trace_functions=1
# PCMK_trace_formats=1
# PCMK_trace_tags=1

export PCMK_ipc_type=socket
CLUSTER_USER=hacluster
COROSYNC=corosync
PACEMAKERD=pacemakerd
PACEMAKER_PROCESSES=pacemaker
APPPATH=/opt/ha/sbin/
SLEEPINTERVALL=10
SLEEPCOUNT=5
SLEPT=0


killapp() {
   pid=`pgrep -f $1`
   if [ "x$pid" != "x" ]; then
      kill -9 $pid 
   fi
   return 0
}

start0() {
        stop0
        su ${CLUSTER_USER} -c ${APPPATH}${COROSYNC}
        sleep $sleep0
        su ${CLUSTER_USER} -c ${APPPATH}${PACEMAKERD} &
        return 0
}

stop0() {
# first try, graceful shutdown
        pid=`pgrep -U ${CLUSTER_USER} -f ${PACEMAKERD}`
        if [ "x$pid" != "x" ]; then
           ${APPPATH}${PACEMAKERD} --shutdown 
           sleep $SLEEPINTERVALL
        fi
# second try, kill the rest
        killapp ${APPPATH}${COROSYNC}
        killapp ${PACEMAKER_PROCESSES}
        return 0
}

let sleep0=$SLEEPINTERVALL/2
case "$1" in
'start')
        start0
        ;;
'restart')
        stop0
        start0
        ;;
'stop')
        stop0
        ;;
*)
        echo "Usage: -bash { start | stop | restart}"
        exit 1
        ;;
esac
exit 0
###############################

Andreas


-----Ursprüngliche Nachricht-----
Von: Andrei Belov [mailto:defanator at gmail.com] 
Gesendet: Montag, 25. März 2013 15:08
An: The Pacemaker cluster resource manager
Betreff: Re: [Pacemaker] solaris problem


Ok, I fixed this issue with the following patch against libqb 0.14.4:

--- lib/unix.c.orig     2013-03-25 12:30:50.445762231 +0000
+++ lib/unix.c  2013-03-25 12:49:59.322276376 +0000
@@ -83,7 +83,7 @@
 #if defined(QB_LINUX) || defined(QB_CYGWIN)
                snprintf(path, PATH_MAX, "/dev/shm/%s", file);  #else
-               snprintf(path, PATH_MAX, LOCALSTATEDIR "/run/%s", file);
+               snprintf(path, PATH_MAX, "%s/%s", SOCKETDIR, file);
                is_absolute = path;
 #endif
        }
@@ -91,7 +91,7 @@
        if (fd < 0 && !is_absolute) {
                qb_util_perror(LOG_ERR, "couldn't open file %s", path);
 
-               snprintf(path, PATH_MAX, LOCALSTATEDIR "/run/%s", file);
+               snprintf(path, PATH_MAX, "%s/%s", SOCKETDIR, file);
                fd = open_mmap_file(path, file_flags);
                if (fd < 0) {
                        res = -errno;


libqb was configured with --with-socket-dir=/var/run/qb, /var/run/qb owned by hacluster:haclient - this configuration works fine with both corosync 2.3.0 and pacemaker 1.1.8.

Though I'm not sure that libqb is the right place to touch - maybe it'd be better to add some enhancements to pacemaker's lib/common/mainloop.c,
mainloop_add_ipc_server() ?


Cheers.


On Mar 25, 2013, at 16:01 , Andrei Belov <defanator at gmail.com> wrote:

> 
> I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set hacluster:haclient ownership to this dir.
> 
> After that pacemakerd has been successfully started with all its childs:
> 
> [root at ha1 /var/run/qb]# pacemakerd -fV Could not establish pacemakerd 
> connection: Connection refused (146)
>    info: crm_ipc_connect:      Could not establish pacemakerd connection: Connection refused (146)
>    info: get_cluster_type:     Detected an active 'corosync' cluster
>    info: read_config:  Reading configure for stack: corosync
>  notice: crm_add_logfile:      Additional logging available in /var/log/cluster/corosync.log
>  notice: main:         Starting Pacemaker 1.1.8 (Build: 1f8858c):  ncurses libqb-logging libqb-ipc upstart systemd  corosync-native
>    info: main:         Maximum core file size is: 18446744073709551613
>    info: qb_ipcs_us_publish:   server name: pacemakerd
>  notice: update_node_processes:        48de70 Node 182452614 now known as ha1, was: 
>    info: start_child:  Forked child 60719 for process cib
>    info: start_child:  Forked child 60720 for process stonith-ng
>    info: start_child:  Forked child 60721 for process lrmd
>    info: start_child:  Forked child 60722 for process attrd
>    info: start_child:  Forked child 60723 for process pengine
>    info: start_child:  Forked child 60724 for process crmd
>    info: main:         Starting mainloop
> 
> [root at ha1 /var/run/qb]# ls -l
> total 0
> srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 attrd
> srwxrwxrwx 1 root      root 0 Mar 25 11:43 cfg
> srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_ro srwxrwxrwx 1 
> hacluster root 0 Mar 25 11:50 cib_rw srwxrwxrwx 1 hacluster root 0 Mar 
> 25 11:50 cib_shm
> srwxrwxrwx 1 root      root 0 Mar 25 11:43 cmap
> srwxrwxrwx 1 root      root 0 Mar 25 11:43 cpg
> srwxrwxrwx 1 root      root 0 Mar 25 11:50 lrmd
> srwxrwxrwx 1 root      root 0 Mar 25 11:50 pacemakerd
> srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 pengine
> srwxrwxrwx 1 root      root 0 Mar 25 11:43 quorum
> srwxrwxrwx 1 root      root 0 Mar 25 11:50 stonith-ng
> 
> However, libqb still can not create some files in /var/run due to insufficient permissions:
> 
> Mar 25 11:50:45 [60719]        cib:     info: init_cs_connection_once:  Connection to 'corosync': established
> Mar 25 11:50:45 [60719]        cib:     info: crm_get_peer:     Node 182452614 is now known as ha1
> Mar 25 11:50:45 [60719]        cib:     info: crm_get_peer:     Node 182452614 has uuid 182452614
> Mar 25 11:50:45 [60719]        cib:     info: qb_ipcs_us_publish:       server name: cib_ro
> Mar 25 11:50:45 [60719]        cib:     info: qb_ipcs_us_publish:       server name: cib_rw
> Mar 25 11:50:45 [60719]        cib:     info: qb_ipcs_us_publish:       server name: cib_shm
> Mar 25 11:50:45 [60719]        cib:     info: cib_init:         Starting cib mainloop
> Mar 25 11:50:45 [60719]        cib:     info: pcmk_cpg_membership:      Joined[0.0] cib.182452614 
> Mar 25 11:50:45 [60719]        cib:     info: pcmk_cpg_membership:      Member[0.0] cib.182452614 
> Mar 25 11:50:45 [60719]        cib:     info: pcmk_cpg_membership:      Member[0.1] cib.182452614 
> Mar 25 11:50:46 [60719]        cib:    error: qb_sys_mmap_file_open:    couldn't open file /var/run/qb-cib_rw-control-60719-60720-15: Permission denied (13)
> Mar 25 11:50:46 [60719]        cib:    error: qb_ipcs_us_connect:       couldn't create file for mmap (60719-60720-15): Permission denied (13)
> Mar 25 11:50:46 [60719]        cib:    error: handle_new_connection:    Invalid IPC credentials (60719-60720-15).
> Mar 25 11:50:46 [60720] stonith-ng:     info: crm_ipc_connect:  Could not establish cib_rw connection: Permission denied (13)
> Mar 25 11:50:46 [60719]        cib:    error: qb_sys_mmap_file_open:    couldn't open file /var/run/qb-cib_shm-control-60719-60724-16: Permission denied (13)
> Mar 25 11:50:46 [60719]        cib:    error: qb_ipcs_us_connect:       couldn't create file for mmap (60719-60724-16): Permission denied (13)
> Mar 25 11:50:46 [60719]        cib:    error: handle_new_connection:    Invalid IPC credentials (60719-60724-16).
> Mar 25 11:50:46 [60724]       crmd:     info: crm_ipc_connect:  Could not establish cib_shm connection: Permission denied (13)
> Mar 25 11:50:46 [60724]       crmd:     info: do_cib_control:   Could not connect to the CIB service: Transport endpoint is not connected
> Mar 25 11:50:46 [60724]       crmd:  warning: do_cib_control:   Couldn't complete CIB registration 1 times... pause and retry
> 
> 
> If someone has working setup on Linux with corosync 2.x, libqb and pacemaker 1.1.x - I'd be very appreciated for sharing some information about a places which libqb uses for its special socket files.
> 
> Thanks in advance!
> 
> (Can we say now that this problem is libqb-related, not pacemaker?)
> 
> 
> 
> On Mar 25, 2013, at 15:30 , Andrei Belov <defanator at gmail.com> wrote:
> 
>> Andreas,
>> 
>> just tried "PCMK_ipc_type=socket pacemaker -fV" - a bunch of additional "event_send" errors appeared:
>> 
>> Mar 25 11:15:55 [33641] ha1 corosync error   [MAIN  ] event_send retuned -32, expected 256!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 217!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 219!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 256!
>> Mar 25 11:15:55 [53980]    pengine:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/pengine): Permission denied (13)
>> Mar 25 11:15:55 [53980]    pengine:    error: mainloop_add_ipc_server:  Could not start pengine IPC server: Unknown error (-13)
>> Mar 25 11:15:55 [53980]    pengine:    error: main:     Couldn't start IPC server
>> Mar 25 11:15:55 [53975] pacemakerd:    error: pcmk_child_exit:  Child process pengine exited (pid=53980, rc=1)
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 256!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [53979]      attrd:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/attrd): Permission denied (13)
>> Mar 25 11:15:55 [53979]      attrd:    error: mainloop_add_ipc_server:  Could not start attrd IPC server: Unknown error (-13)
>> Mar 25 11:15:55 [53979]      attrd:    error: main:     Could not start IPC server
>> Mar 25 11:15:55 [53979]      attrd:    error: main:     Aborting startup
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [53975] pacemakerd:    error: pcmk_child_exit:  Child process attrd exited (pid=53979, rc=100)
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 256!
>> Mar 25 11:15:55 [53976]        cib:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/cib_ro): Permission denied (13)
>> Mar 25 11:15:55 [53976]        cib:    error: mainloop_add_ipc_server:  Could not start cib_ro IPC server: Unknown error (-13)
>> Mar 25 11:15:55 [53976]        cib:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/cib_rw): Permission denied (13)
>> Mar 25 11:15:55 [53976]        cib:    error: mainloop_add_ipc_server:  Could not start cib_rw IPC server: Unknown error (-13)
>> Mar 25 11:15:55 [53976]        cib:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/cib_shm): Permission denied (13)
>> Mar 25 11:15:55 [53976]        cib:    error: mainloop_add_ipc_server:  Could not start cib_shm IPC server: Unknown error (-13)
>> Mar 25 11:15:55 [53976]        cib:    error: cib_init:         Couldnt start all IPC channels, exiting.
>> Mar 25 11:15:55 [53975] pacemakerd:    error: pcmk_child_exit:  Child process cib exited (pid=53976, rc=255)
>> Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 223!
>> Mar 25 11:16:04 [53977] stonith-ng:    error: setup_cib:        Could not connect to the CIB service: -134 fffffd7fc421a0b0
>> Mar 25 11:16:04 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, expected 217!
>> Mar 25 11:16:04 [53975] pacemakerd:   notice: pcmk_shutdown_worker:     Attempting to inhibit respawning after fatal error
>> 
>> 
>> # fgrep 32 /usr/include/sys/errno.h 
>> #define EPIPE   32      /* Broken pipe                          */
>> 
>> 
>> 
>> On Mar 25, 2013, at 13:55 , "Grüninger, Andreas (LGL Extern)" <Andreas.Grueninger at lgl.bwl.de> wrote:
>> 
>>> With solaris/openindiana you should use this setting export 
>>> PCMK_ipc_type=socket
>>> 
>>> Andreas
>>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: Andrei Belov [mailto:defanator at gmail.com]
>>> Gesendet: Montag, 25. März 2013 10:43
>>> An: pacemaker at oss.clusterlabs.org
>>> Betreff: [Pacemaker] solaris problem
>>> 
>>> Hi folks,
>>> 
>>> I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, corosync 2.3.0 and pacemaker 1.1.8, and I'm facing a strange problem while starting pacemaker.
>>> 
>>> Log shows the following errors:
>>> 
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33720]       lrmd:    error: try_server_create:        New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process.
>>> Mar 25 09:21:26 [33720]       lrmd:    error: main:     Failed to allocate lrmd server.  shutting down
>>> Mar 25 09:21:26 [33722]    pengine:    error: mainloop_add_ipc_server:  Could not start pengine IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33722]    pengine:    error: main:     Couldn't start IPC server
>>> Mar 25 09:21:26 [33717] pacemakerd:    error: pcmk_child_exit:  Child process lrmd exited (pid=33720, rc=255)
>>> Mar 25 09:21:26 [33721]      attrd:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/attrd): Permission denied (13)
>>> Mar 25 09:21:26 [33721]      attrd:    error: mainloop_add_ipc_server:  Could not start attrd IPC server: Unknown error (-13)
>>> Mar 25 09:21:26 [33721]      attrd:    error: main:     Could not start IPC server
>>> Mar 25 09:21:26 [33721]      attrd:    error: main:     Aborting startup
>>> Mar 25 09:21:26 [33717] pacemakerd:    error: pcmk_child_exit:  Child process pengine exited (pid=33722, rc=1)
>>> Mar 25 09:21:26 [33717] pacemakerd:    error: pcmk_child_exit:  Child process attrd exited (pid=33721, rc=100)
>>> Mar 25 09:21:26 [33718]        cib:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/cib_ro): Permission denied (13)
>>> Mar 25 09:21:26 [33718]        cib:    error: mainloop_add_ipc_server:  Could not start cib_ro IPC server: Unknown error (-13)
>>> Mar 25 09:21:26 [33718]        cib:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (/var/run/cib_rw): Permission denied (13)
>>> Mar 25 09:21:26 [33718]        cib:    error: mainloop_add_ipc_server:  Could not start cib_rw IPC server: Unknown error (-13)
>>> Mar 25 09:21:26 [33718]        cib:    error: mainloop_add_ipc_server:  Could not start cib_shm IPC server: Unknown error (-48)
>>> Mar 25 09:21:26 [33718]        cib:    error: cib_init:         Couldnt start all IPC channels, exiting.
>>> Mar 25 09:21:26 [33717] pacemakerd:    error: pcmk_child_exit:  Child process cib exited (pid=33718, rc=255)
>>> Mar 25 09:21:35 [33719] stonith-ng:    error: setup_cib:        Could not connect to the CIB service: -134 fffffd7fc421a0b0
>>> Mar 25 09:21:35 [33717] pacemakerd:   notice: pcmk_shutdown_worker:     Attempting to inhibit respawning after fatal error
>>> 
>>> Full log (in case of any things I've probably missed) is attached.
>>> 
>>> I wonder to know the reason of "unknown error (-48)" - on this system 48 in errno.h is "ENOTSUP", but I haven't found the exact place in code where this may happen (so I'm not sure about that).
>>> 
>>> Just for record - I'm able to run corosync on two nodes and see them connected without any visible problems - thus, I suppose there may be something wrong with either pacemaker or libqb.
>>> 
>>> Any help will be greatly appreciated!
>>> 
>>> Thanks,
>>> Andrei.
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org Getting started: 
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
> 


_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list