[Pacemaker] Pacemaker/corosync freeze

Attila Megyeri amegyeri at minerva-soft.com
Thu Mar 13 09:50:57 EDT 2014


Hi Honza,

What I also found in the log related to the freeze at 12:22:26:


Corosync main process was not scheduled for  XXXX... Can It be the general cause of the issue?



Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597->[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647->[10.9.1.3]:161


Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000.0000 ms). Consider token timeout increase.


Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the OPERATIONAL state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming new configuration.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.).
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token because I am the rep.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high seq received 6a8c
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for ring 7dc
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:

....


Regards,
Attila

> -----Original Message-----
> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> Sent: Thursday, March 13, 2014 2:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> > -----Original Message-----
> > From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> > Sent: Thursday, March 13, 2014 1:45 PM
> > To: The Pacemaker cluster resource manager; Andrew Beekhof
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> > Hello,
> >
> > > -----Original Message-----
> > > From: Jan Friesse [mailto:jfriesse at redhat.com]
> > > Sent: Thursday, March 13, 2014 10:03 AM
> > > To: The Pacemaker cluster resource manager
> > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> > >
> > > ...
> > >
> > > >>>>
> > > >>>> Also can you please try to set debug: on in corosync.conf and
> > > >>>> paste full corosync.log then?
> > > >>>
> > > >>> I set debug to on, and did a few restarts but could not
> > > >>> reproduce the issue
> > > >> yet - will post the logs as soon as I manage to reproduce.
> > > >>>
> > > >>
> > > >> Perfect.
> > > >>
> > > >> Another option you can try to set is netmtu (1200 is usually safe).
> > > >
> > > > Finally I was able to reproduce the issue.
> > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> > > > (not
> > > when node was up again).
> > > >
> > > > The corosync log with debug on is available at:
> > > > http://pastebin.com/kTpDqqtm
> > > >
> > > >
> > > > To be honest, I had to wait much longer for this reproduction as
> > > > before,
> > > even though there was no change in the corosync configuration - just
> > > potentially some system updates. But anyway, the issue is
> > > unfortunately still there.
> > > > Previously, when this issue came, cpu was at 100% on all nodes -
> > > > this time
> > > only on ctmgr, which was the DC...
> > > >
> > > > I hope you can find some useful details in the log.
> > > >
> > >
> > > Attila,
> > > what seems to be interesting is
> > >
> > > Configuration ERRORs found during PE processing.  Please run
> > > "crm_verify -
> > L"
> > > to identify issues.
> > >
> > > I'm unsure how much is this problem but I'm really not pacemaker
> expert.
> >
> > Perhaps Andrew could comment on that. Any idea?
> >
> >
> > >
> > > Anyway, I have theory what may happening and it looks like related
> > > with IPC (and probably not related to network). But to make sure we
> > > will not try fixing already fixed bug, can you please build:
> > > - New libqb (0.17.0). There are plenty of fixes in IPC
> > > - Corosync 2.3.3 (already plenty IPC fixes)
> > > - And maybe also newer pacemaker
> > >
> >
> > I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
> > from Ubuntu package.
> > I am currently building libqb 0.17.0, will update you on the results.
> >
> > In the meantime we had another freeze, which did not seem to be
> > related to any restarts, but brought all coroync processes to 100%.
> > Please check out the corosync.log, perhaps it is a different cause:
> > http://pastebin.com/WMwzv0Rr
> >
> >
> > In the meantime I will install the new libqb and send logs if we have
> > further issues.
> >
> > Thank you very much for your help!
> >
> > Regards,
> > Attila
> >
> 
> One more question:
> 
> If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, or if
> it was built with libqb 0.16.0 it will be fine?
> 
> BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can
> see if it makes a difference. If I see crashes on the outdated ones, but not on
> the new ones, we are fine. :)
> 
> Thanks,
> 
> Attila
> 
> 
> 
> 
> 
> 
> 
> >
> >
> > > I know you were not very happy using hand-compiled sources, but
> > > please give them at least a try.
> > >
> > > Thanks,
> > >   Honza
> > >
> > > > Thanks,
> > > > Attila
> > > >
> > > >
> > > >
> > > >>
> > > >> Regards,
> > > >>   Honza
> > > >>
> > > >>>
> > > >>> There are also a few things that might or might not be related:
> > > >>>
> > > >>> 1) Whenever I want to edit the configuration with "crm configure
> > > >>> edit",
> > >
> > > ...
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org Getting started:
> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list