[Pacemaker] Pacemaker/corosync freeze

Attila Megyeri amegyeri at minerva-soft.com
Thu Mar 13 08:44:50 EDT 2014


Hello,

> -----Original Message-----
> From: Jan Friesse [mailto:jfriesse at redhat.com]
> Sent: Thursday, March 13, 2014 10:03 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> ...
> 
> >>>>
> >>>> Also can you please try to set debug: on in corosync.conf and paste
> >>>> full corosync.log then?
> >>>
> >>> I set debug to on, and did a few restarts but could not reproduce
> >>> the issue
> >> yet - will post the logs as soon as I manage to reproduce.
> >>>
> >>
> >> Perfect.
> >>
> >> Another option you can try to set is netmtu (1200 is usually safe).
> >
> > Finally I was able to reproduce the issue.
> > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
> when node was up again).
> >
> > The corosync log with debug on is available at:
> > http://pastebin.com/kTpDqqtm
> >
> >
> > To be honest, I had to wait much longer for this reproduction as before,
> even though there was no change in the corosync configuration - just
> potentially some system updates. But anyway, the issue is unfortunately still
> there.
> > Previously, when this issue came, cpu was at 100% on all nodes - this time
> only on ctmgr, which was the DC...
> >
> > I hope you can find some useful details in the log.
> >
> 
> Attila,
> what seems to be interesting is
> 
> Configuration ERRORs found during PE processing.  Please run "crm_verify -L"
> to identify issues.
> 
> I'm unsure how much is this problem but I'm really not pacemaker expert.

Perhaps Andrew could comment on that. Any idea?


> 
> Anyway, I have theory what may happening and it looks like related with IPC
> (and probably not related to network). But to make sure we will not try fixing
> already fixed bug, can you please build:
> - New libqb (0.17.0). There are plenty of fixes in IPC
> - Corosync 2.3.3 (already plenty IPC fixes)
> - And maybe also newer pacemaker
> 

I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package.
I am currently building libqb 0.17.0, will update you on the results.

In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%.
Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr 


In the meantime I will install the new libqb and send logs if we have further issues.

Thank you very much for your help!

Regards,
Attila



> I know you were not very happy using hand-compiled sources, but please
> give them at least a try.
> 
> Thanks,
>   Honza
> 
> > Thanks,
> > Attila
> >
> >
> >
> >>
> >> Regards,
> >>   Honza
> >>
> >>>
> >>> There are also a few things that might or might not be related:
> >>>
> >>> 1) Whenever I want to edit the configuration with "crm configure
> >>> edit",
> 
> ...
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list