[Pacemaker] Pacemaker/corosync freeze
Jan Friesse
jfriesse at redhat.com
Wed Mar 12 08:51:18 UTC 2014
Attila Megyeri napsal(a):
>
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Tuesday, March 11, 2014 10:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 12 Mar 2014, at 1:54 am, Attila Megyeri <amegyeri at minerva-soft.com>
>> wrote:
>>
>>>>
>>>> -----Original Message-----
>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>>
>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri <amegyeri at minerva-soft.com>
>>>> wrote:
>>>>
>>>>> Thanks for the quick response!
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>> <amegyeri at minerva-soft.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>> From time to time, something unexpected happens and suddenly the
>>>>>> crm_mon output remains static.
>>>>>>> When I check the cpu usage, I see that one of the cores uses 100%
>>>>>>> cpu, but
>>>>>> cannot actually match it to either the corosync or one of the
>>>>>> pacemaker processes.
>>>>>>>
>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>> corosync, then
>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
>>>>>> most of the cases, usually a kill -9 is needed.
>>>>>>>
>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>
>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>>>>>>
>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>
>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>>>> CPG
>>>>>> messages (1 remaining, last=8): Try again (6)
>>>>>>>
>>>>>>> OR
>>>>>>>
>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
>> CPG
>>>>>> messages (1 remaining, last=10933): Try again (
>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
>> CPG
>>>>>> messages (1 remaining, last=10933): Try again (
>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
>> CPG
>>>>>> messages (1 remaining, last=10933): Try again (
>>>>>>
>>>>>> That is usually a symptom of corosync getting into a horribly
>>>>>> confused
>>>> state.
>>>>>> Version? Distro? Have you checked for an update?
>>>>>> Odd that the user of all that CPU isn't showing up though.
>>>>>>
>>>>>>>
>>>>>
>>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>>>
>>>>> corosync 2.3.0-1ubuntu5
>>>>> pacemaker 1.1.10+git20130802-1ubuntu2
>>>>
>>>> Ah sorry, I seem to have missed that part.
>>>>
>>>>>
>>>>> There are no updates available. The only option is to install from
>>>>> sources,
>>>> but that would be very difficult to maintain and I'm not sure I would
>>>> get rid of this issue.
>>>>>
>>>>> What do you recommend?
>>>>
>>>> The same thing as Lars, or switch to a distro that stays current with
>>>> upstream (git shows 5 newer releases for that branch since it was
>>>> released 3 years ago).
>>>> If you do build from source, its probably best to go with v1.4.6
>>>
>>> Hm, I am a bit confused here. We are using 2.3.0,
>>
>> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
>
> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like:
>
> Mar 12 07:36:55 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:56 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6)
>
>
Attila,
> Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting?
First of all, 1.x branch (flatiron) is maintained so even it looks like
a old version, it's quite a new. It contains more or less only bugfixes.
2.x branch (needle) contains not only bugfixes but also new features.
Keep in mind that with 1.x you need to use cman as quorum provider (2.x
contains quorum in base).
There are no big differences in build.
But back to your original question. Of course troubleshooting is always
better.
Try again error (6) is happening when corosync is in sync state. This is
happening when NEW node is discovered, there is network split/merge and
usually takes only few milliseconds. Usually problem you are hitting is
caused by some network issue.
So first of all take a look to corosync.log
(/var/log/cluster/corosync.log). Do you see some warning/error there?
What transport are you using? Multicast (udp) or unicast (udpu)?
Can you please paste your corosync.conf?
Regards,
Honza
>
> Thank you in advance.
>
>
>
>
>
>
>>
>>> which was released approx. a year ago (you mention 3 years) and you
>> recommend 1.4.6, which is a rather old version.
>>> Could you please clarify a bit? :)
>>> Lars recommends 2.3.3 git tree.
>>>
>>> I might end up trying both, but just want to make sure I am not
>> misunderstanding something badly.
>>>
>>> Thank you!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> HTOP show something like this (sorted by TIME+ descending):
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 59,
>> 4
>>>>>> thr; 2 running
>>>>>>> 2 [| 0.7%] Load average: 1.00 0.99 1.02
>>>>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB] Uptime: 1
>>>>>> day, 10:22:03
>>>>>>> Swp[ 0/509MB]
>>>>>>>
>>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>>>>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58
>>>> /usr/sbin/corosync
>>>>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07
>> /usr/sbin/snmpd
>>>> -
>>>>>> Lsd -Lf /dev/null -u snmp -g snm
>>>>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71
>>>>>> /usr/lib/pacemaker/cib
>>>>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06
>>>>>> /usr/lib/pacemaker/stonithd
>>>>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15
>> /usr/sbin/watchdog
>>>>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62
>>>>>> /usr/lib/pacemaker/crmd
>>>>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64
>>>>>> /usr/lib/pacemaker/lrmd
>>>>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01
>>>>>> /usr/lib/pacemaker/attrd
>>>>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd
>>>>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read
>>>> process
>>>>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25
>>>>>> /usr/lib/pacemaker/pengine
>>>>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write
>>>> process
>>>>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -
>> p
>>>>>> /var/run/ntpd.pid -g -u 105:112
>>>>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75
>> /usr/sbin/irqbalance
>>>>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c
>>>>>> /etc/monit/monitrc
>>>>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a
>> -
>>>> w
>>>>>> /var/log/atop/atop_20140306 6
>>>>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd
>>>>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init
>>>>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd
>>>>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd
>>>>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop
>>>>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>
>>>>>>>
>>>>>>> My questions:
>>>>>>> - Is this a cororync or pacameker issue?
>>>>>>> - What are the CPG messages? Is it possible that we have a firewall
>>>> issue?
>>>>>>>
>>>>>>>
>>>>>>> Any hints would be great!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Attila
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list