[Pacemaker] Pacemaker/corosync freeze
Attila Megyeri
amegyeri at minerva-soft.com
Wed Mar 12 10:56:03 UTC 2014
Hello Jan,
Thank you very much for your help so far.
> -----Original Message-----
> From: Jan Friesse [mailto:jfriesse at redhat.com]
> Sent: Wednesday, March 12, 2014 9:51 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
> Attila Megyeri napsal(a):
> >
> >> -----Original Message-----
> >> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >> Sent: Tuesday, March 11, 2014 10:27 PM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >> <amegyeri at minerva-soft.com>
> >> wrote:
> >>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >>>> Sent: Tuesday, March 11, 2014 12:48 AM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>>
> >>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
> >>>> <amegyeri at minerva-soft.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks for the quick response!
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >>>>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>>>> To: The Pacemaker cluster resource manager
> >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>
> >>>>>>
> >>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>>>> <amegyeri at minerva-soft.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>>>> From time to time, something unexpected happens and suddenly
> the
> >>>>>> crm_mon output remains static.
> >>>>>>> When I check the cpu usage, I see that one of the cores uses
> >>>>>>> 100% cpu, but
> >>>>>> cannot actually match it to either the corosync or one of the
> >>>>>> pacemaker processes.
> >>>>>>>
> >>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>>>> corosync, then
> >>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
> >>>>>> most of the cases, usually a kill -9 is needed.
> >>>>>>>
> >>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>>>
> >>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
> passive.
> >>>>>>>
> >>>>>>> Logs are usually flooded with CPG related messages, such as:
> >>>>>>>
> >>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
> 0
> >>>> CPG
> >>>>>> messages (1 remaining, last=8): Try again (6)
> >>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
> 0
> >>>> CPG
> >>>>>> messages (1 remaining, last=8): Try again (6)
> >>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
> 0
> >>>> CPG
> >>>>>> messages (1 remaining, last=8): Try again (6)
> >>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent
> 0
> >>>> CPG
> >>>>>> messages (1 remaining, last=8): Try again (6)
> >>>>>>>
> >>>>>>> OR
> >>>>>>>
> >>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
> >> CPG
> >>>>>> messages (1 remaining, last=10933): Try again (
> >>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
> >> CPG
> >>>>>> messages (1 remaining, last=10933): Try again (
> >>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0
> >> CPG
> >>>>>> messages (1 remaining, last=10933): Try again (
> >>>>>>
> >>>>>> That is usually a symptom of corosync getting into a horribly
> >>>>>> confused
> >>>> state.
> >>>>>> Version? Distro? Have you checked for an update?
> >>>>>> Odd that the user of all that CPU isn't showing up though.
> >>>>>>
> >>>>>>>
> >>>>>
> >>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
> >>>>>
> >>>>> corosync 2.3.0-1ubuntu5
> >>>>> pacemaker 1.1.10+git20130802-1ubuntu2
> >>>>
> >>>> Ah sorry, I seem to have missed that part.
> >>>>
> >>>>>
> >>>>> There are no updates available. The only option is to install from
> >>>>> sources,
> >>>> but that would be very difficult to maintain and I'm not sure I
> >>>> would get rid of this issue.
> >>>>>
> >>>>> What do you recommend?
> >>>>
> >>>> The same thing as Lars, or switch to a distro that stays current
> >>>> with upstream (git shows 5 newer releases for that branch since it
> >>>> was released 3 years ago).
> >>>> If you do build from source, its probably best to go with v1.4.6
> >>>
> >>> Hm, I am a bit confused here. We are using 2.3.0,
> >>
> >> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
> >
> > I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the
> same issue - after some time CPU gets to 100%, and the corosync log is
> flooded with messages like:
> >
> > Mar 12 07:36:55 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
> messages (48 remaining, last=3671): Try again (6)
> > Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG
> messages (51 remaining, last=3995): Try again (6)
> > Mar 12 07:36:56 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
> messages (48 remaining, last=3671): Try again (6)
> > Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG
> messages (51 remaining, last=3995): Try again (6)
> > Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
> messages (48 remaining, last=3671): Try again (6)
> > Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG
> messages (51 remaining, last=3995): Try again (6)
> > Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG
> messages (48 remaining, last=3671): Try again (6)
> >
> >
>
> Attila,
>
> > Shall I try to downgrade to 1.4.6? What is the difference in that build? Or
> where should I start troubleshooting?
>
> First of all, 1.x branch (flatiron) is maintained so even it looks like a old
> version, it's quite a new. It contains more or less only bugfixes.
>
OK - The next thing I will try will be to downgrade to 1.4.6 if the troubleshooting does not bring us closer.
Actually we have a couple of clusters running 1.4.2, but stack is "openais" not corosync. Currently we use "corosync".
> 2.x branch (needle) contains not only bugfixes but also new features.
>
> Keep in mind that with 1.x you need to use cman as quorum provider (2.x
> contains quorum in base).
>
> There are no big differences in build.
>
> But back to your original question. Of course troubleshooting is always
> better.
>
> Try again error (6) is happening when corosync is in sync state. This is
> happening when NEW node is discovered, there is network split/merge and
> usually takes only few milliseconds. Usually problem you are hitting is caused
> by some network issue.
I can confirm this. The 100% cpu issue happens when I restart one of the nodes. It seems that it is happening when a given node comes backup up and a new membership is about to be formed.
>
> So first of all take a look to corosync.log (/var/log/cluster/corosync.log). Do
> you see some warning/error there?
Not really. I reproduced a case so you can see for yourself.
Initially I had a stable cluster.
At 10:42:39 I did a reboot on the "ctsip1" node. All was fine until the node came back up (around 10:43:00). At this point, the cpu usage went to 100% and corosync stopped working properly.
here is the relevant corosync.log: http://pastebin.com/HJENEdZj
>
> What transport are you using? Multicast (udp) or unicast (udpu)?
>
> Can you please paste your corosync.conf?
We use udpu, since the servers are in different subnets and multicast did not work as expected. (In our other systems we use multicast).
The corosync.conf is at: http://pastebin.com/dMivQJn5
Thank you in advance,
Regards,
Attila
>
> Regards,
> Honza
>
> >
> > Thank you in advance.
> >
> >
> >
> >
> >
> >
> >>
> >>> which was released approx. a year ago (you mention 3 years) and you
> >> recommend 1.4.6, which is a rather old version.
> >>> Could you please clarify a bit? :)
> >>> Lars recommends 2.3.3 git tree.
> >>>
> >>> I might end up trying both, but just want to make sure I am not
> >> misunderstanding something badly.
> >>>
> >>> Thank you!
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> HTOP show something like this (sorted by TIME+ descending):
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks:
> 59,
> >> 4
> >>>>>> thr; 2 running
> >>>>>>> 2 [| 0.7%] Load average: 1.00 0.99 1.02
> >>>>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB]
> Uptime: 1
> >>>>>> day, 10:22:03
> >>>>>>> Swp[ 0/509MB]
> >>>>>>>
> >>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+
> Command
> >>>>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58
> >>>> /usr/sbin/corosync
> >>>>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07
> >> /usr/sbin/snmpd
> >>>> -
> >>>>>> Lsd -Lf /dev/null -u snmp -g snm
> >>>>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71
> >>>>>> /usr/lib/pacemaker/cib
> >>>>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06
> >>>>>> /usr/lib/pacemaker/stonithd
> >>>>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15
> >> /usr/sbin/watchdog
> >>>>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62
> >>>>>> /usr/lib/pacemaker/crmd
> >>>>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64
> >>>>>> /usr/lib/pacemaker/lrmd
> >>>>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01
> >>>>>> /usr/lib/pacemaker/attrd
> >>>>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd
> >>>>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read
> >>>> process
> >>>>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25
> >>>>>> /usr/lib/pacemaker/pengine
> >>>>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write
> >>>> process
> >>>>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80
> /usr/sbin/ntpd -
> >> p
> >>>>>> /var/run/ntpd.pid -g -u 105:112
> >>>>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75
> >> /usr/sbin/irqbalance
> >>>>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49
> /usr/bin/monit -c
> >>>>>> /etc/monit/monitrc
> >>>>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77
> >>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop
> -a
> >> -
> >>>> w
> >>>>>> /var/log/atop/atop_20140306 6
> >>>>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd
> >>>>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03
> >>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init
> >>>>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd
> >>>>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd
> >>>>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38
> >>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38
> >>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37
> >>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop
> >>>>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36
> >>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>
> >>>>>>>
> >>>>>>> My questions:
> >>>>>>> - Is this a cororync or pacameker issue?
> >>>>>>> - What are the CPG messages? Is it possible that we have a firewall
> >>>> issue?
> >>>>>>>
> >>>>>>>
> >>>>>>> Any hints would be great!
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Attila
> >>>>>>> _______________________________________________
> >>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>>>
> >>>>>>> Project Home: http://www.clusterlabs.org Getting started:
> >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>
> >>>>> Project Home: http://www.clusterlabs.org Getting started:
> >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list