[Pacemaker] corosync/openais fails to start

Thu May 27 17:43:07 UTC 2010

On 05/27/2010 10:20 AM, Gianluca Cecchi wrote:
> On Thu, May 27, 2010 at 5:50 PM, Steven Dake <sdake at redhat.com
> <mailto:sdake at redhat.com>> wrote:
>
>     On 05/27/2010 08:40 AM, Diego Remolina wrote:
>
>         Is there any workaround for this? Perhaps a slightly older
>         version of
>         the rpms? If so where do I find those?
>
>
>     Corosync 1.2.1 doesn't have this issue apparently.  With corosync
>     1.2.1, please don't use "debug: on" keyword in your config options.
>       I am not sure where Andrew has corosync 1.2.1 rpms available.
>
>     The corosync project itself doesn't release rpms.  See our policy on
>     this topic:
>
>     http://www.corosync.org/doku.php?id=faq:release_binaries
>
>     Regards
>     -steve
>
>
>
> In my case, using pacemaker/corosync from clusterlabs repo on rh el 5.5
> 32 bit I had:
> - both nodes ha1 and ha2 with
> [root at ha1 ~]# rpm -qa corosync\* pacemaker\*
> pacemaker-1.0.8-6.el5
> corosynclib-1.2.1-1.el5
> corosync-1.2.1-1.el5
> pacemaker-libs-1.0.8-6.el5
>
> - stop of corosync on node ha1
> - update (using clusterlabs repo proposed and applied packages for
> pacemaker with same version... donna if same bits..)
> This takes corosync to 1.2.2
> - start of corosync on ha1 and successfull join with the still corosync
> 1.2.1 one
>   May 27 18:59:23 ha1 corosync[5136]:   [MAIN  ] Corosync Cluster Engine
> exiting with status -1 at main.c:160.
> May 27 19:06:19 ha1 yum: Updated: corosynclib-1.2.2-1.1.el5.i386
> May 27 19:06:19 ha1 yum: Updated: pacemaker-libs-1.0.8-6.1.el5.i386
> May 27 19:06:19 ha1 yum: Updated: corosync-1.2.2-1.1.el5.i386
> May 27 19:06:20 ha1 yum: Updated: pacemaker-1.0.8-6.1.el5.i386
> May 27 19:06:20 ha1 yum: Updated: corosynclib-devel-1.2.2-1.1.el5.i386
> May 27 19:06:22 ha1 yum: Updated: pacemaker-libs-devel-1.0.8-6.1.el5.i386
> May 27 19:06:59 ha1 corosync[7442]:   [MAIN  ] Corosync Cluster Engine
> ('1.2.2'): started and ready to provide service.
> May 27 19:06:59 ha1 corosync[7442]:   [MAIN  ] Corosync built-in
> features: nss rdma
> May 27 19:06:59 ha1 corosync[7442]:   [MAIN  ] Successfully read main
> configuration file '/etc/corosync/corosync.conf'.
> May 27 19:06:59 ha1 corosync[7442]:   [TOTEM ] Initializing transport
> (UDP/IP).
> May 27 19:06:59 ha1 corosync[7442]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>
> this implies also start of resources on it (nfsclient and apache in my case)
>
> - move (and unmove to be able to take them again) of resources from ha2
> to the updated node ha1 (nfs-group in my case)
>   Resource Group: nfs-group
>       lv_drbd0   (ocf::heartbeat:LVM):   Started ha1
>       ClusterIP  (ocf::heartbeat:IPaddr2):       Started ha1
>       NfsFS      (ocf::heartbeat:Filesystem):    Started ha1
>       nfssrv     (ocf::heartbeat:nfsserver):     Started ha1
>
> - stop of corosync 1.2.1 on ha2
> - update of pacemaker and corosync on ha2
> - startup of corosync on ha2 and correct join to cluster with start of
> its resources (nfsclient and apache in my case)
> May 27 19:14:42 ha2 corosync[30954]:   [pcmk  ] notice: pcmk_shutdown:
> cib confirmed stopped
> May 27 19:14:42 ha2 corosync[30954]:   [pcmk  ] notice: stop_child: Sent
> -15 to stonithd: [30961]
> May 27 19:14:42 ha2 stonithd: [30961]: notice:
> /usr/lib/heartbeat/stonithd normally quit.
> May 27 19:14:42 ha2 corosync[30954]:   [pcmk  ] info: pcmk_ipc_exit:
> Client stonithd (conn=0x82aee48, async-conn=0x82aee48) left
> May 27 19:14:43 ha2 corosync[30954]:   [pcmk  ] notice: pcmk_shutdown:
> stonithd confirmed stopped
> May 27 19:14:43 ha2 corosync[30954]:   [pcmk  ] info: update_member:
> Node ha2 now has process list: 00000000000000000000000000000002 (2)
> May 27 19:14:43 ha2 corosync[30954]:   [pcmk  ] notice: pcmk_shutdown:
> Shutdown complete
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> Pacemaker Cluster Manager 1.0.8
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> corosync extended virtual synchrony service
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> corosync configuration service
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> corosync cluster closed process group service v1.01
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> corosync cluster config database access v1.01
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> corosync profile loading service
> May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
> corosync cluster quorum service v0.1
> May 27 19:14:43 ha2 corosync[30954]:   [MAIN  ] Corosync Cluster Engine
> exiting with status -1 at main.c:160.
> May 27 19:15:51 ha2 yum: Updated: corosynclib-1.2.2-1.1.el5.i386
> May 27 19:15:51 ha2 yum: Updated: pacemaker-libs-1.0.8-6.1.el5.i386
> May 27 19:15:52 ha2 yum: Updated: corosync-1.2.2-1.1.el5.i386
> May 27 19:15:52 ha2 yum: Updated: pacemaker-1.0.8-6.1.el5.i386
> May 27 19:17:00 ha2 corosync[3430]:   [MAIN  ] Corosync Cluster Engine
> ('1.2.2'): started and ready to provide service.
> May 27 19:17:00 ha2 corosync[3430]:   [MAIN  ] Corosync built-in
> features: nss rdma
> May 27 19:17:00 ha2 corosync[3430]:   [MAIN  ] Successfully read main
> configuration file '/etc/corosync/corosync.conf'.
> May 27 19:17:00 ha2 corosync[3430]:   [TOTEM ] Initializing transport
> (UDP/IP).
> May 27 19:17:00 ha2 corosync[3430]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>
> So in my case the sw upgrade was successfull with no downtime.
>
> Gianluca
>
>
>

It appears this problem only affects some deployments.  The three crash 
reports I have seen showed a crash at the same spot during startup.

I'll update when a solution is available.

regards
-steve
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf