[Pacemaker] Pacemaker/corosync freeze
Andrew Beekhof
andrew at beekhof.net
Tue Mar 11 21:27:08 UTC 2014
On 12 Mar 2014, at 1:54 am, Attila Megyeri <amegyeri at minerva-soft.com> wrote:
>>
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Tuesday, March 11, 2014 12:48 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri <amegyeri at minerva-soft.com>
>> wrote:
>>
>>> Thanks for the quick response!
>>>
>>>> -----Original Message-----
>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>>
>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri <amegyeri at minerva-soft.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>> From time to time, something unexpected happens and suddenly the
>>>> crm_mon output remains static.
>>>>> When I check the cpu usage, I see that one of the cores uses 100%
>>>>> cpu, but
>>>> cannot actually match it to either the corosync or one of the
>>>> pacemaker processes.
>>>>>
>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>> corosync, then
>>>> start pacemeker. Stoping pacemaker and corosync does not work in most
>>>> of the cases, usually a kill -9 is needed.
>>>>>
>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>
>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>>>>
>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>
>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>> CPG
>>>> messages (1 remaining, last=8): Try again (6)
>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>> CPG
>>>> messages (1 remaining, last=8): Try again (6)
>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>> CPG
>>>> messages (1 remaining, last=8): Try again (6)
>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0
>> CPG
>>>> messages (1 remaining, last=8): Try again (6)
>>>>>
>>>>> OR
>>>>>
>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0 CPG
>>>> messages (1 remaining, last=10933): Try again (
>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0 CPG
>>>> messages (1 remaining, last=10933): Try again (
>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0 CPG
>>>> messages (1 remaining, last=10933): Try again (
>>>>
>>>> That is usually a symptom of corosync getting into a horribly confused
>> state.
>>>> Version? Distro? Have you checked for an update?
>>>> Odd that the user of all that CPU isn't showing up though.
>>>>
>>>>>
>>>
>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>
>>> corosync 2.3.0-1ubuntu5
>>> pacemaker 1.1.10+git20130802-1ubuntu2
>>
>> Ah sorry, I seem to have missed that part.
>>
>>>
>>> There are no updates available. The only option is to install from sources,
>> but that would be very difficult to maintain and I'm not sure I would get rid of
>> this issue.
>>>
>>> What do you recommend?
>>
>> The same thing as Lars, or switch to a distro that stays current with upstream
>> (git shows 5 newer releases for that branch since it was released 3 years
>> ago).
>> If you do build from source, its probably best to go with v1.4.6
>
> Hm, I am a bit confused here. We are using 2.3.0,
I swapped the 2 for a 1 somehow. A bit distracted, sorry.
> which was released approx. a year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old version.
> Could you please clarify a bit? :)
> Lars recommends 2.3.3 git tree.
>
> I might end up trying both, but just want to make sure I am not misunderstanding something badly.
>
> Thank you!
>
>
>
>
>
>
>
>
>>
>>>
>>>
>>>>>
>>>>> HTOP show something like this (sorted by TIME+ descending):
>>>>>
>>>>>
>>>>>
>>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 59, 4
>>>> thr; 2 running
>>>>> 2 [| 0.7%] Load average: 1.00 0.99 1.02
>>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB] Uptime: 1
>>>> day, 10:22:03
>>>>> Swp[ 0/509MB]
>>>>>
>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58
>> /usr/sbin/corosync
>>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd
>> -
>>>> Lsd -Lf /dev/null -u snmp -g snm
>>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71
>>>> /usr/lib/pacemaker/cib
>>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06
>>>> /usr/lib/pacemaker/stonithd
>>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog
>>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62
>>>> /usr/lib/pacemaker/crmd
>>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64
>>>> /usr/lib/pacemaker/lrmd
>>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01
>>>> /usr/lib/pacemaker/attrd
>>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd
>>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read
>> process
>>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25
>>>> /usr/lib/pacemaker/pengine
>>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write
>> process
>>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p
>>>> /var/run/ntpd.pid -g -u 105:112
>>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance
>>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c
>>>> /etc/monit/monitrc
>>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77
>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -
>> w
>>>> /var/log/atop/atop_20140306 6
>>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd
>>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03
>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init
>>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd
>>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd
>>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38
>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38
>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37
>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop
>>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36
>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>
>>>>>
>>>>> My questions:
>>>>> - Is this a cororync or pacameker issue?
>>>>> - What are the CPG messages? Is it possible that we have a firewall
>> issue?
>>>>>
>>>>>
>>>>> Any hints would be great!
>>>>>
>>>>> Thanks,
>>>>> Attila
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140312/c2b81f8b/attachment-0004.sig>
More information about the Pacemaker
mailing list