[Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE
Bob Haxo
bhaxo at sgi.com
Thu Jan 13 20:31:42 UTC 2011
Hi Tom (and Andrew),
I figured out an easy fix for the problem that I encountered. However,
there would seem to be a problem lurking in the code.
Here is what I found. On one of the servers that was online and hosting
resources:
r2lead1:~ # netstat -a | grep crm
Proto RefCnt Flags Type State I-Node Path
unix 2 [ ACC ] STREAM LISTENING 18659 /var/run/crm/st_command
unix 2 [ ACC ] STREAM LISTENING 18826 /var/run/crm/cib_rw
unix 2 [ ACC ] STREAM LISTENING 19373 /var/run/crm/crmd
unix 2 [ ACC ] STREAM LISTENING 18675 /var/run/crm/attrd
unix 2 [ ACC ] STREAM LISTENING 18694 /var/run/crm/pengine
unix 2 [ ACC ] STREAM LISTENING 18824 /var/run/crm/cib_callback
unix 2 [ ACC ] STREAM LISTENING 18825 /var/run/crm/cib_ro
unix 2 [ ACC ] STREAM LISTENING 18662 /var/run/crm/st_callback
unix 3 [ ] STREAM CONNECTED 20659 /var/run/crm/cib_callback
unix 3 [ ] STREAM CONNECTED 20656 /var/run/crm/cib_rw
unix 3 [ ] STREAM CONNECTED 19952 /var/run/crm/attrd
unix 3 [ ] STREAM CONNECTED 19944 /var/run/crm/st_callback
unix 3 [ ] STREAM CONNECTED 19941 /var/run/crm/st_command
unix 3 [ ] STREAM CONNECTED 19359 /var/run/crm/cib_callback
unix 3 [ ] STREAM CONNECTED 19356 /var/run/crm/cib_rw
unix 3 [ ] STREAM CONNECTED 19353 /var/run/crm/cib_callback
unix 3 [ ] STREAM CONNECTED 19350 /var/run/crm/cib_rw
On the node that was failing to join the HA cluster, this command
returned nothing.
However, on one of the functioning servers the above stream information
was returned, but included an additional ** 941 ** instances of the
following (with different I-Node numbers):
unix 3 [ ] STREAM CONNECTED 1238243 /var/run/crm/pengine
unix 3 [ ] STREAM CONNECTED 1237524 /var/run/crm/pengine
unix 3 [ ] STREAM CONNECTED 1236698 /var/run/crm/pengine
unix 3 [ ] STREAM CONNECTED 1235930 /var/run/crm/pengine
unix 3 [ ] STREAM CONNECTED 1235094 /var/run/crm/pengine
Here is how I corrected the situation:
service openais stop on the 941 pengine stream system; service openais
restart on the server that was failing to join the HA cluster.
Results:
The previously failing server joined the HA cluster and supports
migration of resources to that server.
service openais start of the server that had had the 941 pengine streams
and that too came online.
Regards,
Bob Haxo
On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote:
> So, Tom ...how do you get the failed node online?
>
> I've re-installed with the same image that is running on three other
> nodes, but still fails. This node was quite happy for the past 3
> months. As I'm testing installs, this and other nodes have been
> installed a significant number of times without this sort of failure.
> I'd whack the whole HA cluster ... except that I don't want to run into
> this failure again without better solution than "reinstall the
> system" ;-)
>
> I'm looking at the information retuned with corosync debug enabled.
> After startup, everything looks fine to me until hitting this apparent
> local ipc delivery failure:
>
> Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3
> Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> Jan 13 10:09:10 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> Jan 13 10:09:10 corosync [pcmk ] Msg[6486] (dest=local:crmd, from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref
> Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue
>
> Guess that I'll have to renew my acquaintance with ipc.
>
> Bob Haxo
>
>
>
> On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote:
> > I don't know. I still have this issue (and it seems, that I'm not the
> > only one...). I'll have a look, if there are pacemaker-updates through
> > the zypper-update-channel available (sles11-sp1).
> >
> > Regards,
> > Tom
> >
> >
> > 2011/1/13 Bob Haxo <bhaxo at sgi.com>:
> > > Tom, others,
> > >
> > > Please, what was the solution to this issue?
> > >
> > > Thanks,
> > > Bob Haxo
> > >
> > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote:
> > >
> > > Yes, corosync is running after the reboot. It comes up with the
> > > regular init-procedure (runlevel 3 in my case).
> > >
> > > 2010/9/6 Andrew Beekhof <andrew at beekhof.net>:
> > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtux80 at gmail.com> wrote:
> > >>> No, I don't have such failed-messages. In my case, the "Connection to
> > >>> our AIS plugin" was established.
> > >>>
> > >>> The /dev/shm is also not full.
> > >>
> > >> Is corosync running?
> > >>
> > >>> Kind regards,
> > >>> Tom
> > >>>
> > >>> 2010/9/3 Michael Smith <msmith at cbnco.com>:
> > >>>> Tom Tux wrote:
> > >>>>
> > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes
> > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not join
> > >>>>> himself automatically into the cluster. After the reboot, I have the
> > >>>>> following error- and warn-messages in the log:
> > >>>>>
> > >>>>> Sep 3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: live
> > >>>>
> > >>>> Do you have messages like this, too?
> > >>>>
> > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]: [IPC ] Invalid IPC
> > >>>> credentials.
> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection:
> > >>>> Connection to our AIS plugin (9) failed: unknown (100)
> > >>>>
> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in to
> > >>>> the cluster... terminating
> > >>>>
> > >>>>
> > >>>>
> > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e
> > >>>>
> > >>>> Mike
> > >>>>
> > >>>> _______________________________________________
> > >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >>>>
> > >>>> Project Home: http://www.clusterlabs.org
> > >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>>> Bugs:
> > >>>>
> > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >>>>
> > >>>
> > >>> _______________________________________________
> > >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >>>
> > >>> Project Home: http://www.clusterlabs.org
> > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>> Bugs:
> > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >>>
> > >>
> > >> _______________________________________________
> > >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >>
> > >> Project Home: http://www.clusterlabs.org
> > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >> Bugs:
> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >>
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs:
> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >
More information about the Pacemaker
mailing list