[Pacemaker] Pacemaker/corosync freeze

Fri Mar 14 08:37:20 UTC 2014

Attila Megyeri napsal(a):
> Hi Honza,
> 
> What I also found in the log related to the freeze at 12:22:26:
> 
> 
> Corosync main process was not scheduled for  XXXX... Can It be the general cause of the issue?
> 
> 
> 
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647->[10.9.1.3]:161
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000.0000 ms). Consider token timeout increase.
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the OPERATIONAL state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming new configuration.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.).
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token because I am the rep.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high seq received 6a8c
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for ring 7dc
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
> 
> ....
> 
> 
> Regards,
> Attila
> 
>> -----Original Message-----
>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>> Sent: Thursday, March 13, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>>> -----Original Message-----
>>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>>> Sent: Thursday, March 13, 2014 1:45 PM
>>> To: The Pacemaker cluster resource manager; Andrew Beekhof
>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>
>>> Hello,
>>>
>>>> -----Original Message-----
>>>> From: Jan Friesse [mailto:jfriesse at redhat.com]
>>>> Sent: Thursday, March 13, 2014 10:03 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>> ...
>>>>
>>>>>>>>
>>>>>>>> Also can you please try to set debug: on in corosync.conf and
>>>>>>>> paste full corosync.log then?
>>>>>>>
>>>>>>> I set debug to on, and did a few restarts but could not
>>>>>>> reproduce the issue
>>>>>> yet - will post the logs as soon as I manage to reproduce.
>>>>>>>
>>>>>>
>>>>>> Perfect.
>>>>>>
>>>>>> Another option you can try to set is netmtu (1200 is usually safe).
>>>>>
>>>>> Finally I was able to reproduce the issue.
>>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
>>>>> (not
>>>> when node was up again).
>>>>>
>>>>> The corosync log with debug on is available at:
>>>>> http://pastebin.com/kTpDqqtm
>>>>>
>>>>>
>>>>> To be honest, I had to wait much longer for this reproduction as
>>>>> before,
>>>> even though there was no change in the corosync configuration - just
>>>> potentially some system updates. But anyway, the issue is
>>>> unfortunately still there.
>>>>> Previously, when this issue came, cpu was at 100% on all nodes -
>>>>> this time
>>>> only on ctmgr, which was the DC...
>>>>>
>>>>> I hope you can find some useful details in the log.
>>>>>
>>>>
>>>> Attila,
>>>> what seems to be interesting is
>>>>
>>>> Configuration ERRORs found during PE processing.  Please run
>>>> "crm_verify -
>>> L"
>>>> to identify issues.
>>>>
>>>> I'm unsure how much is this problem but I'm really not pacemaker
>> expert.
>>>
>>> Perhaps Andrew could comment on that. Any idea?
>>>
>>>
>>>>
>>>> Anyway, I have theory what may happening and it looks like related
>>>> with IPC (and probably not related to network). But to make sure we
>>>> will not try fixing already fixed bug, can you please build:
>>>> - New libqb (0.17.0). There are plenty of fixes in IPC
>>>> - Corosync 2.3.3 (already plenty IPC fixes)
>>>> - And maybe also newer pacemaker
>>>>
>>>
>>> I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
>>> from Ubuntu package.
>>> I am currently building libqb 0.17.0, will update you on the results.
>>>
>>> In the meantime we had another freeze, which did not seem to be
>>> related to any restarts, but brought all coroync processes to 100%.
>>> Please check out the corosync.log, perhaps it is a different cause:
>>> http://pastebin.com/WMwzv0Rr
>>>
>>>
>>> In the meantime I will install the new libqb and send logs if we have
>>> further issues.
>>>
>>> Thank you very much for your help!
>>>
>>> Regards,
>>> Attila
>>>
>>
>> One more question:
>>
>> If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, or if
>> it was built with libqb 0.16.0 it will be fine?
>>

Theoretically everything should work (both libqb and corosync keeps
binary compatibility). In practice it's always better to recompile.

>> BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can
>> see if it makes a difference. If I see crashes on the outdated ones, but not on
>> the new ones, we are fine. :)
>>
>> Thanks,
>>
>> Attila
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>> I know you were not very happy using hand-compiled sources, but
>>>> please give them at least a try.
>>>>
>>>> Thanks,
>>>>   Honza
>>>>
>>>>> Thanks,
>>>>> Attila
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>   Honza
>>>>>>
>>>>>>>
>>>>>>> There are also a few things that might or might not be related:
>>>>>>>
>>>>>>> 1) Whenever I want to edit the configuration with "crm configure
>>>>>>> edit",
>>>>
>>>> ...
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>