[Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE
Bob Haxo
bhaxo at sgi.com
Fri Jan 14 15:59:54 UTC 2011
> Where there (m)any logs containing the text "crm_abort" ...
Sorry Andrew,
Since I'm testing installations, all of the nodes in the cluster have
been installed several times since I solved this issue, and the original
log files are gone.
I did not see "crm_abort" logged, otherwise I would have captured the
messages in my notes.
I searched my notes (to be certain), and I searched the history of all
of the windows that I had been tailing the messages files without
finding a single instance of the string "crm_abort". Some logging does
also go to the headnode of these HA clusters, but no "crm_abort" there
either.
Are there (by default) any logs other than in /var/log?
Bob Haxo
On Fri, 2011-01-14 at 13:50 +0100, Andrew Beekhof wrote:
> On Thu, Jan 13, 2011 at 9:31 PM, Bob Haxo <bhaxo at sgi.com> wrote:
> > Hi Tom (and Andrew),
> >
> > I figured out an easy fix for the problem that I encountered. However,
> > there would seem to be a problem lurking in the code.
>
> Where there (m)any logs containing the text "crm_abort" from the PE in
> your history (on the bad node)?
> Thats the only way i can imagine so many copies of that file being open.
>
> >
> > Here is what I found. On one of the servers that was online and hosting
> > resources:
> >
> > r2lead1:~ # netstat -a | grep crm
> > Proto RefCnt Flags Type State I-Node Path
> > unix 2 [ ACC ] STREAM LISTENING 18659 /var/run/crm/st_command
> > unix 2 [ ACC ] STREAM LISTENING 18826 /var/run/crm/cib_rw
> > unix 2 [ ACC ] STREAM LISTENING 19373 /var/run/crm/crmd
> > unix 2 [ ACC ] STREAM LISTENING 18675 /var/run/crm/attrd
> > unix 2 [ ACC ] STREAM LISTENING 18694 /var/run/crm/pengine
> > unix 2 [ ACC ] STREAM LISTENING 18824 /var/run/crm/cib_callback
> > unix 2 [ ACC ] STREAM LISTENING 18825 /var/run/crm/cib_ro
> > unix 2 [ ACC ] STREAM LISTENING 18662 /var/run/crm/st_callback
> > unix 3 [ ] STREAM CONNECTED 20659 /var/run/crm/cib_callback
> > unix 3 [ ] STREAM CONNECTED 20656 /var/run/crm/cib_rw
> > unix 3 [ ] STREAM CONNECTED 19952 /var/run/crm/attrd
> > unix 3 [ ] STREAM CONNECTED 19944 /var/run/crm/st_callback
> > unix 3 [ ] STREAM CONNECTED 19941 /var/run/crm/st_command
> > unix 3 [ ] STREAM CONNECTED 19359 /var/run/crm/cib_callback
> > unix 3 [ ] STREAM CONNECTED 19356 /var/run/crm/cib_rw
> > unix 3 [ ] STREAM CONNECTED 19353 /var/run/crm/cib_callback
> > unix 3 [ ] STREAM CONNECTED 19350 /var/run/crm/cib_rw
> >
> > On the node that was failing to join the HA cluster, this command
> > returned nothing.
> >
> > However, on one of the functioning servers the above stream information
> > was returned, but included an additional ** 941 ** instances of the
> > following (with different I-Node numbers):
> >
> > unix 3 [ ] STREAM CONNECTED 1238243 /var/run/crm/pengine
> > unix 3 [ ] STREAM CONNECTED 1237524 /var/run/crm/pengine
> > unix 3 [ ] STREAM CONNECTED 1236698 /var/run/crm/pengine
> > unix 3 [ ] STREAM CONNECTED 1235930 /var/run/crm/pengine
> > unix 3 [ ] STREAM CONNECTED 1235094 /var/run/crm/pengine
> >
> > Here is how I corrected the situation:
> >
> > service openais stop on the 941 pengine stream system; service openais
> > restart on the server that was failing to join the HA cluster.
> >
> > Results:
> >
> > The previously failing server joined the HA cluster and supports
> > migration of resources to that server.
> >
> > service openais start of the server that had had the 941 pengine streams
> > and that too came online.
> >
> > Regards,
> > Bob Haxo
> >
> > On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote:
> >> So, Tom ...how do you get the failed node online?
> >>
> >> I've re-installed with the same image that is running on three other
> >> nodes, but still fails. This node was quite happy for the past 3
> >> months. As I'm testing installs, this and other nodes have been
> >> installed a significant number of times without this sort of failure.
> >> I'd whack the whole HA cluster ... except that I don't want to run into
> >> this failure again without better solution than "reinstall the
> >> system" ;-)
> >>
> >> I'm looking at the information retuned with corosync debug enabled.
> >> After startup, everything looks fine to me until hitting this apparent
> >> local ipc delivery failure:
> >>
> >> Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3
> >> Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> >> Jan 13 10:09:10 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> >> Jan 13 10:09:10 corosync [pcmk ] Msg[6486] (dest=local:crmd, from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref
> >> Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue
> >>
> >> Guess that I'll have to renew my acquaintance with ipc.
> >>
> >> Bob Haxo
> >>
> >>
> >>
> >> On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote:
> >> > I don't know. I still have this issue (and it seems, that I'm not the
> >> > only one...). I'll have a look, if there are pacemaker-updates through
> >> > the zypper-update-channel available (sles11-sp1).
> >> >
> >> > Regards,
> >> > Tom
> >> >
> >> >
> >> > 2011/1/13 Bob Haxo <bhaxo at sgi.com>:
> >> > > Tom, others,
> >> > >
> >> > > Please, what was the solution to this issue?
> >> > >
> >> > > Thanks,
> >> > > Bob Haxo
> >> > >
> >> > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote:
> >> > >
> >> > > Yes, corosync is running after the reboot. It comes up with the
> >> > > regular init-procedure (runlevel 3 in my case).
> >> > >
> >> > > 2010/9/6 Andrew Beekhof <andrew at beekhof.net>:
> >> > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtux80 at gmail.com> wrote:
> >> > >>> No, I don't have such failed-messages. In my case, the "Connection to
> >> > >>> our AIS plugin" was established.
> >> > >>>
> >> > >>> The /dev/shm is also not full.
> >> > >>
> >> > >> Is corosync running?
> >> > >>
> >> > >>> Kind regards,
> >> > >>> Tom
> >> > >>>
> >> > >>> 2010/9/3 Michael Smith <msmith at cbnco.com>:
> >> > >>>> Tom Tux wrote:
> >> > >>>>
> >> > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes
> >> > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not join
> >> > >>>>> himself automatically into the cluster. After the reboot, I have the
> >> > >>>>> following error- and warn-messages in the log:
> >> > >>>>>
> >> > >>>>> Sep 3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: live
> >> > >>>>
> >> > >>>> Do you have messages like this, too?
> >> > >>>>
> >> > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]: [IPC ] Invalid IPC
> >> > >>>> credentials.
> >> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection:
> >> > >>>> Connection to our AIS plugin (9) failed: unknown (100)
> >> > >>>>
> >> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in to
> >> > >>>> the cluster... terminating
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e
> >> > >>>>
> >> > >>>> Mike
> >> > >>>>
> >> > >>>> _______________________________________________
> >> > >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> > >>>>
> >> > >>>> Project Home: http://www.clusterlabs.org
> >> > >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > >>>> Bugs:
> >> > >>>>
> >> > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> > >>>>
> >> > >>>
> >> > >>> _______________________________________________
> >> > >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> > >>>
> >> > >>> Project Home: http://www.clusterlabs.org
> >> > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > >>> Bugs:
> >> > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> > >>>
> >> > >>
> >> > >> _______________________________________________
> >> > >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> > >>
> >> > >> Project Home: http://www.clusterlabs.org
> >> > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > >> Bugs:
> >> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> > >>
> >> > >
> >> > > _______________________________________________
> >> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> > >
> >> > > Project Home: http://www.clusterlabs.org
> >> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > > Bugs:
> >> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> > >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list