[Pacemaker] Pacemaker/corosync freeze

David Vossel dvossel at redhat.com
Thu Mar 13 21:22:29 CET 2014





----- Original Message -----
> From: "Jan Friesse" <jfriesse at redhat.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, March 13, 2014 4:03:28 AM
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> ...
> 
> >>>>
> >>>> Also can you please try to set debug: on in corosync.conf and paste
> >>>> full corosync.log then?
> >>>
> >>> I set debug to on, and did a few restarts but could not reproduce the
> >>> issue
> >> yet - will post the logs as soon as I manage to reproduce.
> >>>
> >>
> >> Perfect.
> >>
> >> Another option you can try to set is netmtu (1200 is usually safe).
> > 
> > Finally I was able to reproduce the issue.
> > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
> > when node was up again).
> > 
> > The corosync log with debug on is available at:
> > http://pastebin.com/kTpDqqtm
> > 
> > 
> > To be honest, I had to wait much longer for this reproduction as before,
> > even though there was no change in the corosync configuration - just
> > potentially some system updates. But anyway, the issue is unfortunately
> > still there.
> > Previously, when this issue came, cpu was at 100% on all nodes - this time
> > only on ctmgr, which was the DC...
> > 
> > I hope you can find some useful details in the log.
> > 
> 
> Attila,
> what seems to be interesting is
> 
> Configuration ERRORs found during PE processing.  Please run "crm_verify
> -L" to identify issues.
> 
> I'm unsure how much is this problem but I'm really not pacemaker expert.
> 
> Anyway, I have theory what may happening and it looks like related with
> IPC (and probably not related to network). But to make sure we will not
> try fixing already fixed bug, can you please build:
> - New libqb (0.17.0). There are plenty of fixes in IPC
> - Corosync 2.3.3 (already plenty IPC fixes)

yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this.

> - And maybe also newer pacemaker
> 
> I know you were not very happy using hand-compiled sources, but please
> give them at least a try.
> 
> Thanks,
>   Honza
> 
> > Thanks,
> > Attila
> > 
> > 
> > 
> >>
> >> Regards,
> >>   Honza
> >>
> >>>
> >>> There are also a few things that might or might not be related:
> >>>
> >>> 1) Whenever I want to edit the configuration with "crm configure edit",
> 
> ...
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



More information about the Pacemaker mailing list