[Pacemaker] Pacemaker/corosync freeze

Wed Mar 12 15:30:53 UTC 2014

Attila Megyeri napsal(a):
>> -----Original Message-----
>> From: Jan Friesse [mailto:jfriesse at redhat.com]
>> Sent: Wednesday, March 12, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>> Attila Megyeri napsal(a):
>>> Hello Jan,
>>>
>>> Thank you very much for your help so far.
>>>
>>>> -----Original Message-----
>>>> From: Jan Friesse [mailto:jfriesse at redhat.com]
>>>> Sent: Wednesday, March 12, 2014 9:51 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>> Attila Megyeri napsal(a):
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>> Sent: Tuesday, March 11, 2014 10:27 PM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
>>>>>> <amegyeri at minerva-soft.com>
>>>>>> wrote:
>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>>>>>> To: The Pacemaker cluster resource manager
>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>>>>>>>> <amegyeri at minerva-soft.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the quick response!
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>>>>>> To: The Pacemaker cluster resource manager
>>>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>>>>>> <amegyeri at minerva-soft.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>>>>>> From time to time, something unexpected happens and
>> suddenly
>>>> the
>>>>>>>>>> crm_mon output remains static.
>>>>>>>>>>> When I check the cpu usage, I see that one of the cores uses
>>>>>>>>>>> 100% cpu, but
>>>>>>>>>> cannot actually match it to either the corosync or one of the
>>>>>>>>>> pacemaker processes.
>>>>>>>>>>>
>>>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>>>>>> corosync, then
>>>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work
>>>>>>>>>> in most of the cases, usually a kill -9 is needed.
>>>>>>>>>>>
>>>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>>>>>
>>>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
>>>> passive.
>>>>>>>>>>>
>>>>>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>>>>>
>>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:
>> Sent
>>>> 0
>>>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:
>> Sent
>>>> 0
>>>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:
>> Sent
>>>> 0
>>>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:
>> Sent
>>>> 0
>>>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>>>>
>>>>>>>>>>> OR
>>>>>>>>>>>
>>>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:
>> Sent 0
>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:
>> Sent 0
>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:
>> Sent 0
>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>>>>>
>>>>>>>>>> That is usually a symptom of corosync getting into a horribly
>>>>>>>>>> confused
>>>>>>>> state.
>>>>>>>>>> Version? Distro? Have you checked for an update?
>>>>>>>>>> Odd that the user of all that CPU isn't showing up though.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>>>>>>>
>>>>>>>>> corosync 2.3.0-1ubuntu5
>>>>>>>>> pacemaker 1.1.10+git20130802-1ubuntu2
>>>>>>>>
>>>>>>>> Ah sorry, I seem to have missed that part.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> There are no updates available. The only option is to install
>>>>>>>>> from sources,
>>>>>>>> but that would be very difficult to maintain and I'm not sure I
>>>>>>>> would get rid of this issue.
>>>>>>>>>
>>>>>>>>> What do you recommend?
>>>>>>>>
>>>>>>>> The same thing as Lars, or switch to a distro that stays current
>>>>>>>> with upstream (git shows 5 newer releases for that branch since
>>>>>>>> it was released 3 years ago).
>>>>>>>> If you do build from source, its probably best to go with v1.4.6
>>>>>>>
>>>>>>> Hm, I am a bit confused here. We are using 2.3.0,
>>>>>>
>>>>>> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
>>>>>
>>>>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
>>>>> still the
>>>> same issue - after some time CPU gets to 100%, and the corosync log
>>>> is flooded with messages like:
>>>>>
>>>>> Mar 12 07:36:55 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
>>>> messages  (48 remaining, last=3671): Try again (6)
>>>>> Mar 12 07:36:55 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0
>> CPG
>>>> messages  (51 remaining, last=3995): Try again (6)
>>>>> Mar 12 07:36:56 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
>>>> messages  (48 remaining, last=3671): Try again (6)
>>>>> Mar 12 07:36:56 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0
>> CPG
>>>> messages  (51 remaining, last=3995): Try again (6)
>>>>> Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
>>>> messages  (48 remaining, last=3671): Try again (6)
>>>>> Mar 12 07:36:57 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0
>> CPG
>>>> messages  (51 remaining, last=3995): Try again (6)
>>>>> Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
>>>> messages  (48 remaining, last=3671): Try again (6)
>>>>>
>>>>>
>>>>
>>>> Attila,
>>>>
>>>>> Shall I try to downgrade to 1.4.6? What is the difference in that
>>>>> build? Or
>>>> where should I start troubleshooting?
>>>>
>>>> First of all, 1.x branch (flatiron) is maintained so even it looks
>>>> like a old version, it's quite a new. It contains more or less only bugfixes.
>>>>
>>>
>>> OK - The next thing I will try will be to downgrade to 1.4.6 if the
>> troubleshooting does not bring us closer.
>>> Actually we have a couple of clusters running 1.4.2, but stack is "openais"
>> not corosync. Currently we use "corosync".
>>>
>>>
>>>> 2.x branch (needle) contains not only bugfixes but also new features.
>>>>
>>>> Keep in mind that with 1.x you need to use cman as quorum provider
>>>> (2.x contains quorum in base).
>>>>
>>>> There are no big differences in build.
>>>>
>>>> But back to your original question. Of course troubleshooting is
>>>> always better.
>>>>
>>>> Try again error (6) is happening when corosync is in sync state. This
>>>> is happening when NEW node is discovered, there is network
>>>> split/merge and usually takes only few milliseconds. Usually problem
>>>> you are hitting is caused by some network issue.
>>>
>>> I can confirm this. The 100% cpu issue happens when I restart one of the
>> nodes. It seems that it is happening when a given node comes backup up and
>> a new membership is about to be formed.
>>>
>>>
>>>>
>>>> So first of all take a look to corosync.log
>>>> (/var/log/cluster/corosync.log). Do you see some warning/error there?
>>>
>>> Not really. I reproduced a case so you can see for yourself.
>>> Initially I had a stable cluster.
>>> At 10:42:39 I did a reboot on the "ctsip1" node. All was fine until the node
>> came back up (around 10:43:00). At this point, the cpu usage went to 100%
>> and corosync stopped working properly.
>>>
>>> here is the relevant corosync.log:  http://pastebin.com/HJENEdZj
>>>
>>
>> Is that log file somehow continue? I mean, interesting is:
>> Mar 12 10:43:00 [973] ctdb2 corosync notice  [TOTEM ] A new membership
>> (10.9.1.3:1592) was formed. Members joined: 168362281
> 
> That was all what I sent. From 43:00 the corosync process appears to be irresponsive. One of the cores was at 100% cpu, but from htop, top or similar apps it is impossible to gues which app it is. Of course, killing corosync with -9 lowers the cpu usage.
> 

That sounds very bad.

>>
>> What means new membership and sync is now running but there is no
>> counterpart (which looks like:
>>
>> Mar 12 10:42:40 [973] ctdb2 corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service. ).
> 
> 
>>
>> Can you please try remove rrp and use pure srp?
> 
> If I understand correctly, you are asking me to remove the redundant link (the second interface block from the .conf) completely, right?
> 

Yes, exactly. I don't think it will help but it's at least good to try.

> 
>>
>> Also can you please try to set debug: on in corosync.conf and paste full
>> corosync.log then?
> 
> I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce.
> 

Perfect.

Another option you can try to set is netmtu (1200 is usually safe).

Regards,
  Honza

> 
> There are also a few things that might or might not be related:
> 
> 1) Whenever I want to edit the configuration with "crm configure edit", upon save I get a similar error:
>         " ERROR: 47: duplicate element cib-bootstrap-options
>         Do you want to edit again? "
>         But there is no such duplicate elment as far as I can tell. This might be a crmsh issue, and not related to corosync at all, just mentioning.
> 
> 2)
>         "Mar 11 21:31:11 [4797] ctdb2    pengine:    error: process_pe_message:  Calculated Transition 27: /var/lib/pacemaker/pengine/pe-error-7.bz2
>         Mar 11 21:31:11 [4797] ctdb2    pengine:   notice: process_pe_message:  Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to         identify issues."
> 
>         But crm_veryfy -L shows no problems at all...
> 
> 
> 
> 
>>
>> Regards,
>>   Honza
>>
>>>
>>>
>>>>
>>>> What transport are you using? Multicast (udp) or unicast (udpu)?
>>>>
>>>> Can you please paste your corosync.conf?
>>>
>>> We use udpu, since the servers are in different subnets and multicast did
>> not work as expected. (In our other systems we use multicast).
>>>
>>> The corosync.conf is at: http://pastebin.com/dMivQJn5
>>>
>>>
>>> Thank you in advance,
>>>
>>> Regards,
>>> Attila
>>>
>>>
>>>>
>>>> Regards,
>>>>   Honza
>>>>
>>>>>
>>>>> Thank you in advance.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> which was released approx. a year ago (you mention 3 years) and
>>>>>>> you
>>>>>> recommend 1.4.6, which is a rather old version.
>>>>>>> Could you please clarify a bit? :) Lars recommends 2.3.3 git tree.
>>>>>>>
>>>>>>> I might end up trying both, but just want to make sure I am not
>>>>>> misunderstanding something badly.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> HTOP show something like this (sorted by TIME+ descending):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1  [||||||||||||||||||||||||||||||||||||||||100.0%]
>> Tasks:
>>>> 59,
>>>>>> 4
>>>>>>>>>> thr; 2 running
>>>>>>>>>>> 2  [|                                         0.7%]     Load average: 1.00 0.99 1.02
>>>>>>>>>>> Mem[||||||||||||||||||||||||||||||||     165/994MB]
>>>> Uptime: 1
>>>>>>>>>> day, 10:22:03
>>>>>>>>>>> Swp[                                       0/509MB]
>>>>>>>>>>>
>>>>>>>>>>> PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+
>>>> Command
>>>>>>>>>>> 921 root       20   0  188M 49220 33856 R  0.0  4.8  3h33:58
>>>>>>>> /usr/sbin/corosync
>>>>>>>>>>> 1277 snmp       20   0 45708  4248  1472 S  0.0  0.4  1:33.07
>>>>>> /usr/sbin/snmpd
>>>>>>>> -
>>>>>>>>>> Lsd -Lf /dev/null -u snmp -g snm
>>>>>>>>>>> 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
>>>>>>>>>> /usr/lib/pacemaker/cib
>>>>>>>>>>> 1312 root       20   0  104M  7484  3780 S  0.0  0.7  0:38.06
>>>>>>>>>> /usr/lib/pacemaker/stonithd
>>>>>>>>>>> 1611 root       -2   0  4408  2356  2000 S  0.0  0.2  0:24.15
>>>>>> /usr/sbin/watchdog
>>>>>>>>>>> 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
>>>>>>>>>> /usr/lib/pacemaker/crmd
>>>>>>>>>>> 1313 root       20   0 81784  3800  2876 S  0.0  0.4  0:18.64
>>>>>>>>>> /usr/lib/pacemaker/lrmd
>>>>>>>>>>> 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
>>>>>>>>>> /usr/lib/pacemaker/attrd
>>>>>>>>>>> 1309 root       20   0  104M  4804  2580 S  0.0  0.5  0:15.56
>> pacemakerd
>>>>>>>>>>> 1250 root       20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd:
>> read
>>>>>>>> process
>>>>>>>>>>> 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
>>>>>>>>>> /usr/lib/pacemaker/pengine
>>>>>>>>>>> 1252 root       20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd:
>> write
>>>>>>>> process
>>>>>>>>>>> 1835 ntp        20   0 27216  1980  1408 S  0.0  0.2  0:11.80
>>>> /usr/sbin/ntpd -
>>>>>> p
>>>>>>>>>> /var/run/ntpd.pid -g -u 105:112
>>>>>>>>>>> 899 root       20   0 19168   700   488 S  0.0  0.1  0:09.75
>>>>>> /usr/sbin/irqbalance
>>>>>>>>>>> 1642 root       20   0 30696  1556   912 S  0.0  0.2  0:06.49
>>>> /usr/bin/monit -c
>>>>>>>>>> /etc/monit/monitrc
>>>>>>>>>>> 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
>>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>>> 3079 root        0 -20 16864  4592  3508 S  0.0  0.5  0:01.51
>> /usr/bin/atop
>>>> -a
>>>>>> -
>>>>>>>> w
>>>>>>>>>> /var/log/atop/atop_20140306 6
>>>>>>>>>>> 445 syslog     20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
>>>>>>>>>>> 4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03
>>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>>>   1 root       20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
>>>>>>>>>>> 453 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
>>>>>>>>>>> 451 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
>>>>>>>>>>> 4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38
>>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>>> 4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38
>>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>>> 4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37
>>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>>> 23315 root       20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
>>>>>>>>>>> 4367 kamailio   20   0  291M 10000  4864 S  0.0  1.0  0:00.36
>>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> My questions:
>>>>>>>>>>> -   Is this a cororync or pacameker issue?
>>>>>>>>>>> -   What are the CPG messages? Is it possible that we have a
>> firewall
>>>>>>>> issue?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Any hints would be great!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Attila
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>