[Pacemaker] Pacemaker/corosync freeze
Jan Friesse
jfriesse at redhat.com
Wed Mar 12 13:27:03 UTC 2014
Attila Megyeri napsal(a):
> Hello Jan,
>
> Thank you very much for your help so far.
>
>> -----Original Message-----
>> From: Jan Friesse [mailto:jfriesse at redhat.com]
>> Sent: Wednesday, March 12, 2014 9:51 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>> Attila Megyeri napsal(a):
>>>
>>>> -----Original Message-----
>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>> Sent: Tuesday, March 11, 2014 10:27 PM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>>
>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
>>>> <amegyeri at minerva-soft.com>
>>>> wrote:
>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>>>>>> <amegyeri at minerva-soft.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the quick response!
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>>>> To: The Pacemaker cluster resource manager
>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>>>> <amegyeri at minerva-soft.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>>>> From time to time, something unexpected happens and suddenly
>> the
>>>>>>>> crm_mon output remains static.
>>>>>>>>> When I check the cpu usage, I see that one of the cores uses
>>>>>>>>> 100% cpu, but
>>>>>>>> cannot actually match it to either the corosync or one of the
>>>>>>>> pacemaker processes.
>>>>>>>>>
>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>>>> corosync, then
>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
>>>>>>>> most of the cases, usually a kill -9 is needed.
>>>>>>>>>
>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>>>
>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
>> passive.
>>>>>>>>>
>>>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>>>
>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
>> 0
>>>>>> CPG
>>>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
>> 0
>>>>>> CPG
>>>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
>> 0
>>>>>> CPG
>>>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
>> 0
>>>>>> CPG
>>>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>>>>
>>>>>>>>> OR
>>>>>>>>>
>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>>>> messages (1 remaining, last=10933): Try again (
>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>>>> messages (1 remaining, last=10933): Try again (
>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>>>> messages (1 remaining, last=10933): Try again (
>>>>>>>>
>>>>>>>> That is usually a symptom of corosync getting into a horribly
>>>>>>>> confused
>>>>>> state.
>>>>>>>> Version? Distro? Have you checked for an update?
>>>>>>>> Odd that the user of all that CPU isn't showing up though.
>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>>>>>
>>>>>>> corosync 2.3.0-1ubuntu5
>>>>>>> pacemaker 1.1.10+git20130802-1ubuntu2
>>>>>>
>>>>>> Ah sorry, I seem to have missed that part.
>>>>>>
>>>>>>>
>>>>>>> There are no updates available. The only option is to install from
>>>>>>> sources,
>>>>>> but that would be very difficult to maintain and I'm not sure I
>>>>>> would get rid of this issue.
>>>>>>>
>>>>>>> What do you recommend?
>>>>>>
>>>>>> The same thing as Lars, or switch to a distro that stays current
>>>>>> with upstream (git shows 5 newer releases for that branch since it
>>>>>> was released 3 years ago).
>>>>>> If you do build from source, its probably best to go with v1.4.6
>>>>>
>>>>> Hm, I am a bit confused here. We are using 2.3.0,
>>>>
>>>> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
>>>
>>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the
>> same issue - after some time CPU gets to 100%, and the corosync log is
>> flooded with messages like:
>>>
>>> Mar 12 07:36:55 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
>> messages (48 remaining, last=3671): Try again (6)
>>> Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG
>> messages (51 remaining, last=3995): Try again (6)
>>> Mar 12 07:36:56 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
>> messages (48 remaining, last=3671): Try again (6)
>>> Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG
>> messages (51 remaining, last=3995): Try again (6)
>>> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
>> messages (48 remaining, last=3671): Try again (6)
>>> Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG
>> messages (51 remaining, last=3995): Try again (6)
>>> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
>> messages (48 remaining, last=3671): Try again (6)
>>>
>>>
>>
>> Attila,
>>
>>> Shall I try to downgrade to 1.4.6? What is the difference in that build? Or
>> where should I start troubleshooting?
>>
>> First of all, 1.x branch (flatiron) is maintained so even it looks like a old
>> version, it's quite a new. It contains more or less only bugfixes.
>>
>
> OK - The next thing I will try will be to downgrade to 1.4.6 if the troubleshooting does not bring us closer.
> Actually we have a couple of clusters running 1.4.2, but stack is "openais" not corosync. Currently we use "corosync".
>
>
>> 2.x branch (needle) contains not only bugfixes but also new features.
>>
>> Keep in mind that with 1.x you need to use cman as quorum provider (2.x
>> contains quorum in base).
>>
>> There are no big differences in build.
>>
>> But back to your original question. Of course troubleshooting is always
>> better.
>>
>> Try again error (6) is happening when corosync is in sync state. This is
>> happening when NEW node is discovered, there is network split/merge and
>> usually takes only few milliseconds. Usually problem you are hitting is caused
>> by some network issue.
>
> I can confirm this. The 100% cpu issue happens when I restart one of the nodes. It seems that it is happening when a given node comes backup up and a new membership is about to be formed.
>
>
>>
>> So first of all take a look to corosync.log (/var/log/cluster/corosync.log). Do
>> you see some warning/error there?
>
> Not really. I reproduced a case so you can see for yourself.
> Initially I had a stable cluster.
> At 10:42:39 I did a reboot on the "ctsip1" node. All was fine until the node came back up (around 10:43:00). At this point, the cpu usage went to 100% and corosync stopped working properly.
>
> here is the relevant corosync.log: http://pastebin.com/HJENEdZj
>
Is that log file somehow continue? I mean, interesting is:
Mar 12 10:43:00 [973] ctdb2 corosync notice [TOTEM ] A new membership
(10.9.1.3:1592) was formed. Members joined: 168362281
What means new membership and sync is now running but there is no
counterpart (which looks like:
Mar 12 10:42:40 [973] ctdb2 corosync notice [MAIN ] Completed service
synchronization, ready to provide service. ).
Can you please try remove rrp and use pure srp?
Also can you please try to set debug: on in corosync.conf and paste full
corosync.log then?
Regards,
Honza
>
>
>>
>> What transport are you using? Multicast (udp) or unicast (udpu)?
>>
>> Can you please paste your corosync.conf?
>
> We use udpu, since the servers are in different subnets and multicast did not work as expected. (In our other systems we use multicast).
>
> The corosync.conf is at: http://pastebin.com/dMivQJn5
>
>
> Thank you in advance,
>
> Regards,
> Attila
>
>
>>
>> Regards,
>> Honza
>>
>>>
>>> Thank you in advance.
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>> which was released approx. a year ago (you mention 3 years) and you
>>>> recommend 1.4.6, which is a rather old version.
>>>>> Could you please clarify a bit? :)
>>>>> Lars recommends 2.3.3 git tree.
>>>>>
>>>>> I might end up trying both, but just want to make sure I am not
>>>> misunderstanding something badly.
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> HTOP show something like this (sorted by TIME+ descending):
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks:
>> 59,
>>>> 4
>>>>>>>> thr; 2 running
>>>>>>>>> 2 [| 0.7%] Load average: 1.00 0.99 1.02
>>>>>>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB]
>> Uptime: 1
>>>>>>>> day, 10:22:03
>>>>>>>>> Swp[ 0/509MB]
>>>>>>>>>
>>>>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+
>> Command
>>>>>>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58
>>>>>> /usr/sbin/corosync
>>>>>>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07
>>>> /usr/sbin/snmpd
>>>>>> -
>>>>>>>> Lsd -Lf /dev/null -u snmp -g snm
>>>>>>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71
>>>>>>>> /usr/lib/pacemaker/cib
>>>>>>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06
>>>>>>>> /usr/lib/pacemaker/stonithd
>>>>>>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15
>>>> /usr/sbin/watchdog
>>>>>>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62
>>>>>>>> /usr/lib/pacemaker/crmd
>>>>>>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64
>>>>>>>> /usr/lib/pacemaker/lrmd
>>>>>>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01
>>>>>>>> /usr/lib/pacemaker/attrd
>>>>>>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd
>>>>>>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read
>>>>>> process
>>>>>>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25
>>>>>>>> /usr/lib/pacemaker/pengine
>>>>>>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write
>>>>>> process
>>>>>>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80
>> /usr/sbin/ntpd -
>>>> p
>>>>>>>> /var/run/ntpd.pid -g -u 105:112
>>>>>>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75
>>>> /usr/sbin/irqbalance
>>>>>>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49
>> /usr/bin/monit -c
>>>>>>>> /etc/monit/monitrc
>>>>>>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77
>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop
>> -a
>>>> -
>>>>>> w
>>>>>>>> /var/log/atop/atop_20140306 6
>>>>>>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd
>>>>>>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03
>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init
>>>>>>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd
>>>>>>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd
>>>>>>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38
>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38
>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37
>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop
>>>>>>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36
>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My questions:
>>>>>>>>> - Is this a cororync or pacameker issue?
>>>>>>>>> - What are the CPG messages? Is it possible that we have a firewall
>>>>>> issue?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any hints would be great!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Attila
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list