[Pacemaker] Pacemaker/corosync freeze

Wed Mar 12 08:51:18 UTC 2014

Attila Megyeri napsal(a):
> 
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Tuesday, March 11, 2014 10:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 12 Mar 2014, at 1:54 am, Attila Megyeri <amegyeri at minerva-soft.com>
>> wrote:
>>
>>>>
>>>> -----Original Message-----
>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>>
>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri <amegyeri at minerva-soft.com>
>>>> wrote:
>>>>
>>>>> Thanks for the quick response!
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>> <amegyeri at minerva-soft.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>> From time to time, something unexpected happens and suddenly the
>>>>>> crm_mon output remains static.
>>>>>>> When I check the cpu usage, I see that one of the cores uses 100%
>>>>>>> cpu, but
>>>>>> cannot actually match it to either the corosync or one of the
>>>>>> pacemaker processes.
>>>>>>>
>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>> corosync, then
>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
>>>>>> most of the cases, usually a kill -9 is needed.
>>>>>>>
>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>
>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>>>>>>
>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>
>>>>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>
>>>>>>> OR
>>>>>>>
>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0
>> CPG
>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0
>> CPG
>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0
>> CPG
>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>
>>>>>> That is usually a symptom of corosync getting into a horribly
>>>>>> confused
>>>> state.
>>>>>> Version? Distro? Have you checked for an update?
>>>>>> Odd that the user of all that CPU isn't showing up though.
>>>>>>
>>>>>>>
>>>>>
>>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>>>
>>>>> corosync 2.3.0-1ubuntu5
>>>>> pacemaker 1.1.10+git20130802-1ubuntu2
>>>>
>>>> Ah sorry, I seem to have missed that part.
>>>>
>>>>>
>>>>> There are no updates available. The only option is to install from
>>>>> sources,
>>>> but that would be very difficult to maintain and I'm not sure I would
>>>> get rid of this issue.
>>>>>
>>>>> What do you recommend?
>>>>
>>>> The same thing as Lars, or switch to a distro that stays current with
>>>> upstream (git shows 5 newer releases for that branch since it was
>>>> released 3 years ago).
>>>> If you do build from source, its probably best to go with v1.4.6
>>>
>>> Hm, I am a bit confused here. We are using 2.3.0,
>>
>> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
> 
> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like:
> 
> Mar 12 07:36:55 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:55 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0 CPG messages  (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:56 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:56 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0 CPG messages  (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
> Mar 12 07:36:57 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0 CPG messages  (51 remaining, last=3995): Try again (6)
> Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
> 
> 

Attila,

> Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting?

First of all, 1.x branch (flatiron) is maintained so even it looks like
a old version, it's quite a new. It contains more or less only bugfixes.

2.x branch (needle) contains not only bugfixes but also new features.

Keep in mind that with 1.x you need to use cman as quorum provider (2.x
contains quorum in base).

There are no big differences in build.

But back to your original question. Of course troubleshooting is always
better.

Try again error (6) is happening when corosync is in sync state. This is
happening when NEW node is discovered, there is network split/merge and
usually takes only few milliseconds. Usually problem you are hitting is
caused by some network issue.

So first of all take a look to corosync.log
(/var/log/cluster/corosync.log). Do you see some warning/error there?

What transport are you using? Multicast (udp) or unicast (udpu)?

Can you please paste your corosync.conf?

Regards,
  Honza

> 
> Thank you in advance.
> 
> 
> 
> 
> 
> 
>>
>>> which was released approx. a year ago (you mention 3 years) and you
>> recommend 1.4.6, which is a rather old version.
>>> Could you please clarify a bit? :)
>>> Lars recommends 2.3.3 git tree.
>>>
>>> I might end up trying both, but just want to make sure I am not
>> misunderstanding something badly.
>>>
>>> Thank you!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> HTOP show something like this (sorted by TIME+ descending):
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1  [||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 59,
>> 4
>>>>>> thr; 2 running
>>>>>>> 2  [|                                         0.7%]     Load average: 1.00 0.99 1.02
>>>>>>> Mem[||||||||||||||||||||||||||||||||     165/994MB]     Uptime: 1
>>>>>> day, 10:22:03
>>>>>>> Swp[                                       0/509MB]
>>>>>>>
>>>>>>> PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>>>>> 921 root       20   0  188M 49220 33856 R  0.0  4.8  3h33:58
>>>> /usr/sbin/corosync
>>>>>>> 1277 snmp       20   0 45708  4248  1472 S  0.0  0.4  1:33.07
>> /usr/sbin/snmpd
>>>> -
>>>>>> Lsd -Lf /dev/null -u snmp -g snm
>>>>>>> 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
>>>>>> /usr/lib/pacemaker/cib
>>>>>>> 1312 root       20   0  104M  7484  3780 S  0.0  0.7  0:38.06
>>>>>> /usr/lib/pacemaker/stonithd
>>>>>>> 1611 root       -2   0  4408  2356  2000 S  0.0  0.2  0:24.15
>> /usr/sbin/watchdog
>>>>>>> 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
>>>>>> /usr/lib/pacemaker/crmd
>>>>>>> 1313 root       20   0 81784  3800  2876 S  0.0  0.4  0:18.64
>>>>>> /usr/lib/pacemaker/lrmd
>>>>>>> 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
>>>>>> /usr/lib/pacemaker/attrd
>>>>>>> 1309 root       20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
>>>>>>> 1250 root       20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read
>>>> process
>>>>>>> 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
>>>>>> /usr/lib/pacemaker/pengine
>>>>>>> 1252 root       20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write
>>>> process
>>>>>>> 1835 ntp        20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd -
>> p
>>>>>> /var/run/ntpd.pid -g -u 105:112
>>>>>>> 899 root       20   0 19168   700   488 S  0.0  0.1  0:09.75
>> /usr/sbin/irqbalance
>>>>>>> 1642 root       20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit -c
>>>>>> /etc/monit/monitrc
>>>>>>> 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 3079 root        0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop -a
>> -
>>>> w
>>>>>> /var/log/atop/atop_20140306 6
>>>>>>> 445 syslog     20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
>>>>>>> 4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>   1 root       20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
>>>>>>> 453 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
>>>>>>> 451 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
>>>>>>> 4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>> 23315 root       20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
>>>>>>> 4367 kamailio   20   0  291M 10000  4864 S  0.0  1.0  0:00.36
>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
>>>>>>>
>>>>>>>
>>>>>>> My questions:
>>>>>>> -   Is this a cororync or pacameker issue?
>>>>>>> -   What are the CPG messages? Is it possible that we have a firewall
>>>> issue?
>>>>>>>
>>>>>>>
>>>>>>> Any hints would be great!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Attila
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>