[Pacemaker] Pacemaker/corosync freeze
Attila Megyeri
amegyeri at minerva-soft.com
Fri Mar 14 08:28:18 UTC 2014
Hello David,
> -----Original Message-----
> From: David Vossel [mailto:dvossel at redhat.com]
> Sent: Thursday, March 13, 2014 9:22 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
>
>
>
>
> ----- Original Message -----
> > From: "Jan Friesse" <jfriesse at redhat.com>
> > To: "The Pacemaker cluster resource manager"
> > <pacemaker at oss.clusterlabs.org>
> > Sent: Thursday, March 13, 2014 4:03:28 AM
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> > ...
> >
> > >>>>
> > >>>> Also can you please try to set debug: on in corosync.conf and
> > >>>> paste full corosync.log then?
> > >>>
> > >>> I set debug to on, and did a few restarts but could not reproduce
> > >>> the issue
> > >> yet - will post the logs as soon as I manage to reproduce.
> > >>>
> > >>
> > >> Perfect.
> > >>
> > >> Another option you can try to set is netmtu (1200 is usually safe).
> > >
> > > Finally I was able to reproduce the issue.
> > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> > > (not when node was up again).
> > >
> > > The corosync log with debug on is available at:
> > > http://pastebin.com/kTpDqqtm
> > >
> > >
> > > To be honest, I had to wait much longer for this reproduction as
> > > before, even though there was no change in the corosync
> > > configuration - just potentially some system updates. But anyway,
> > > the issue is unfortunately still there.
> > > Previously, when this issue came, cpu was at 100% on all nodes -
> > > this time only on ctmgr, which was the DC...
> > >
> > > I hope you can find some useful details in the log.
> > >
> >
> > Attila,
> > what seems to be interesting is
> >
> > Configuration ERRORs found during PE processing. Please run
> > "crm_verify -L" to identify issues.
> >
> > I'm unsure how much is this problem but I'm really not pacemaker expert.
> >
> > Anyway, I have theory what may happening and it looks like related
> > with IPC (and probably not related to network). But to make sure we
> > will not try fixing already fixed bug, can you please build:
> > - New libqb (0.17.0). There are plenty of fixes in IPC
> > - Corosync 2.3.3 (already plenty IPC fixes)
>
> yes, there was a libqb/corosync interoperation problem that showed these
> same symptoms last year. Updating to the latest corosync and libqb will likely
> resolve this.
I have upgraded all nodes to these version and we are testing. So far no issues.
Thank you very much for your help.
Regards,
Attila
>
> > - And maybe also newer pacemaker
> >
> > I know you were not very happy using hand-compiled sources, but please
> > give them at least a try.
> >
> > Thanks,
> > Honza
> >
> > > Thanks,
> > > Attila
> > >
> > >
> > >
> > >>
> > >> Regards,
> > >> Honza
> > >>
> > >>>
> > >>> There are also a few things that might or might not be related:
> > >>>
> > >>> 1) Whenever I want to edit the configuration with "crm configure
> > >>> edit",
> >
> > ...
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list