[Pacemaker] Pacemaker/corosync freeze

Wed Mar 12 10:23:08 EDT 2014

> -----Original Message-----
> From: Jan Friesse [mailto:jfriesse at redhat.com]
> Sent: Wednesday, March 12, 2014 2:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
> Attila Megyeri napsal(a):
> > Hello Jan,
> >
> > Thank you very much for your help so far.
> >
> >> -----Original Message-----
> >> From: Jan Friesse [mailto:jfriesse at redhat.com]
> >> Sent: Wednesday, March 12, 2014 9:51 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >> Attila Megyeri napsal(a):
> >>>
> >>>> -----Original Message-----
> >>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >>>> Sent: Tuesday, March 11, 2014 10:27 PM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>>
> >>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >>>> <amegyeri at minerva-soft.com>
> >>>> wrote:
> >>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
> >>>>>> To: The Pacemaker cluster resource manager
> >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>
> >>>>>>
> >>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
> >>>>>> <amegyeri at minerva-soft.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks for the quick response!
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>>>>>> To: The Pacemaker cluster resource manager
> >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>>>>>> <amegyeri at minerva-soft.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>>>>>> From time to time, something unexpected happens and
> suddenly
> >> the
> >>>>>>>> crm_mon output remains static.
> >>>>>>>>> When I check the cpu usage, I see that one of the cores uses
> >>>>>>>>> 100% cpu, but
> >>>>>>>> cannot actually match it to either the corosync or one of the
> >>>>>>>> pacemaker processes.
> >>>>>>>>>
> >>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>>>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>>>>>> corosync, then
> >>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work
> >>>>>>>> in most of the cases, usually a kill -9 is needed.
> >>>>>>>>>
> >>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>>>>>
> >>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
> >> passive.
> >>>>>>>>>
> >>>>>>>>> Logs are usually flooded with CPG related messages, such as:
> >>>>>>>>>
> >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>>>
> >>>>>>>>> OR
> >>>>>>>>>
> >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:
> Sent 0
> >>>> CPG
> >>>>>>>> messages  (1 remaining, last=10933): Try again (
> >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:
> Sent 0
> >>>> CPG
> >>>>>>>> messages  (1 remaining, last=10933): Try again (
> >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:
> Sent 0
> >>>> CPG
> >>>>>>>> messages  (1 remaining, last=10933): Try again (
> >>>>>>>>
> >>>>>>>> That is usually a symptom of corosync getting into a horribly
> >>>>>>>> confused
> >>>>>> state.
> >>>>>>>> Version? Distro? Have you checked for an update?
> >>>>>>>> Odd that the user of all that CPU isn't showing up though.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
> >>>>>>>
> >>>>>>> corosync 2.3.0-1ubuntu5
> >>>>>>> pacemaker 1.1.10+git20130802-1ubuntu2
> >>>>>>
> >>>>>> Ah sorry, I seem to have missed that part.
> >>>>>>
> >>>>>>>
> >>>>>>> There are no updates available. The only option is to install
> >>>>>>> from sources,
> >>>>>> but that would be very difficult to maintain and I'm not sure I
> >>>>>> would get rid of this issue.
> >>>>>>>
> >>>>>>> What do you recommend?
> >>>>>>
> >>>>>> The same thing as Lars, or switch to a distro that stays current
> >>>>>> with upstream (git shows 5 newer releases for that branch since
> >>>>>> it was released 3 years ago).
> >>>>>> If you do build from source, its probably best to go with v1.4.6
> >>>>>
> >>>>> Hm, I am a bit confused here. We are using 2.3.0,
> >>>>
> >>>> I swapped the 2 for a 1 somehow. A bit distracted, sorry.
> >>>
> >>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
> >>> still the
> >> same issue - after some time CPU gets to 100%, and the corosync log
> >> is flooded with messages like:
> >>>
> >>> Mar 12 07:36:55 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
> >> messages  (48 remaining, last=3671): Try again (6)
> >>> Mar 12 07:36:55 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0
> CPG
> >> messages  (51 remaining, last=3995): Try again (6)
> >>> Mar 12 07:36:56 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
> >> messages  (48 remaining, last=3671): Try again (6)
> >>> Mar 12 07:36:56 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0
> CPG
> >> messages  (51 remaining, last=3995): Try again (6)
> >>> Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
> >> messages  (48 remaining, last=3671): Try again (6)
> >>> Mar 12 07:36:57 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0
> CPG
> >> messages  (51 remaining, last=3995): Try again (6)
> >>> Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG
> >> messages  (48 remaining, last=3671): Try again (6)
> >>>
> >>>
> >>
> >> Attila,
> >>
> >>> Shall I try to downgrade to 1.4.6? What is the difference in that
> >>> build? Or
> >> where should I start troubleshooting?
> >>
> >> First of all, 1.x branch (flatiron) is maintained so even it looks
> >> like a old version, it's quite a new. It contains more or less only bugfixes.
> >>
> >
> > OK - The next thing I will try will be to downgrade to 1.4.6 if the
> troubleshooting does not bring us closer.
> > Actually we have a couple of clusters running 1.4.2, but stack is "openais"
> not corosync. Currently we use "corosync".
> >
> >
> >> 2.x branch (needle) contains not only bugfixes but also new features.
> >>
> >> Keep in mind that with 1.x you need to use cman as quorum provider
> >> (2.x contains quorum in base).
> >>
> >> There are no big differences in build.
> >>
> >> But back to your original question. Of course troubleshooting is
> >> always better.
> >>
> >> Try again error (6) is happening when corosync is in sync state. This
> >> is happening when NEW node is discovered, there is network
> >> split/merge and usually takes only few milliseconds. Usually problem
> >> you are hitting is caused by some network issue.
> >
> > I can confirm this. The 100% cpu issue happens when I restart one of the
> nodes. It seems that it is happening when a given node comes backup up and
> a new membership is about to be formed.
> >
> >
> >>
> >> So first of all take a look to corosync.log
> >> (/var/log/cluster/corosync.log). Do you see some warning/error there?
> >
> > Not really. I reproduced a case so you can see for yourself.
> > Initially I had a stable cluster.
> > At 10:42:39 I did a reboot on the "ctsip1" node. All was fine until the node
> came back up (around 10:43:00). At this point, the cpu usage went to 100%
> and corosync stopped working properly.
> >
> > here is the relevant corosync.log:  http://pastebin.com/HJENEdZj
> >
>
> Is that log file somehow continue? I mean, interesting is:
> Mar 12 10:43:00 [973] ctdb2 corosync notice  [TOTEM ] A new membership
> (10.9.1.3:1592) was formed. Members joined: 168362281

That was all what I sent. From 43:00 the corosync process appears to be irresponsive. One of the cores was at 100% cpu, but from htop, top or similar apps it is impossible to gues which app it is. Of course, killing corosync with -9 lowers the cpu usage.

>
> What means new membership and sync is now running but there is no
> counterpart (which looks like:
>
> Mar 12 10:42:40 [973] ctdb2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service. ).

>
> Can you please try remove rrp and use pure srp?

If I understand correctly, you are asking me to remove the redundant link (the second interface block from the .conf) completely, right?

>
> Also can you please try to set debug: on in corosync.conf and paste full
> corosync.log then?

I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce.

There are also a few things that might or might not be related:

1) Whenever I want to edit the configuration with "crm configure edit", upon save I get a similar error:
        " ERROR: 47: duplicate element cib-bootstrap-options
        Do you want to edit again? "
        But there is no such duplicate elment as far as I can tell. This might be a crmsh issue, and not related to corosync at all, just mentioning.

2)
        "Mar 11 21:31:11 [4797] ctdb2    pengine:    error: process_pe_message:  Calculated Transition 27: /var/lib/pacemaker/pengine/pe-error-7.bz2
        Mar 11 21:31:11 [4797] ctdb2    pengine:   notice: process_pe_message:  Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to         identify issues."

        But crm_veryfy -L shows no problems at all...

>
> Regards,
>   Honza
>
> >
> >
> >>
> >> What transport are you using? Multicast (udp) or unicast (udpu)?
> >>
> >> Can you please paste your corosync.conf?
> >
> > We use udpu, since the servers are in different subnets and multicast did
> not work as expected. (In our other systems we use multicast).
> >
> > The corosync.conf is at: http://pastebin.com/dMivQJn5
> >
> >
> > Thank you in advance,
> >
> > Regards,
> > Attila
> >
> >
> >>
> >> Regards,
> >>   Honza
> >>
> >>>
> >>> Thank you in advance.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>
> >>>>> which was released approx. a year ago (you mention 3 years) and
> >>>>> you
> >>>> recommend 1.4.6, which is a rather old version.
> >>>>> Could you please clarify a bit? :) Lars recommends 2.3.3 git tree.
> >>>>>
> >>>>> I might end up trying both, but just want to make sure I am not
> >>>> misunderstanding something badly.
> >>>>>
> >>>>> Thank you!
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> HTOP show something like this (sorted by TIME+ descending):
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 1  [||||||||||||||||||||||||||||||||||||||||100.0%]
> Tasks:
> >> 59,
> >>>> 4
> >>>>>>>> thr; 2 running
> >>>>>>>>> 2  [|                                         0.7%]     Load average: 1.00 0.99 1.02
> >>>>>>>>> Mem[||||||||||||||||||||||||||||||||     165/994MB]
> >> Uptime: 1
> >>>>>>>> day, 10:22:03
> >>>>>>>>> Swp[                                       0/509MB]
> >>>>>>>>>
> >>>>>>>>> PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+
> >> Command
> >>>>>>>>> 921 root       20   0  188M 49220 33856 R  0.0  4.8  3h33:58
> >>>>>> /usr/sbin/corosync
> >>>>>>>>> 1277 snmp       20   0 45708  4248  1472 S  0.0  0.4  1:33.07
> >>>> /usr/sbin/snmpd
> >>>>>> -
> >>>>>>>> Lsd -Lf /dev/null -u snmp -g snm
> >>>>>>>>> 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
> >>>>>>>> /usr/lib/pacemaker/cib
> >>>>>>>>> 1312 root       20   0  104M  7484  3780 S  0.0  0.7  0:38.06
> >>>>>>>> /usr/lib/pacemaker/stonithd
> >>>>>>>>> 1611 root       -2   0  4408  2356  2000 S  0.0  0.2  0:24.15
> >>>> /usr/sbin/watchdog
> >>>>>>>>> 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
> >>>>>>>> /usr/lib/pacemaker/crmd
> >>>>>>>>> 1313 root       20   0 81784  3800  2876 S  0.0  0.4  0:18.64
> >>>>>>>> /usr/lib/pacemaker/lrmd
> >>>>>>>>> 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
> >>>>>>>> /usr/lib/pacemaker/attrd
> >>>>>>>>> 1309 root       20   0  104M  4804  2580 S  0.0  0.5  0:15.56
> pacemakerd
> >>>>>>>>> 1250 root       20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd:
> read
> >>>>>> process
> >>>>>>>>> 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
> >>>>>>>> /usr/lib/pacemaker/pengine
> >>>>>>>>> 1252 root       20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd:
> write
> >>>>>> process
> >>>>>>>>> 1835 ntp        20   0 27216  1980  1408 S  0.0  0.2  0:11.80
> >> /usr/sbin/ntpd -
> >>>> p
> >>>>>>>> /var/run/ntpd.pid -g -u 105:112
> >>>>>>>>> 899 root       20   0 19168   700   488 S  0.0  0.1  0:09.75
> >>>> /usr/sbin/irqbalance
> >>>>>>>>> 1642 root       20   0 30696  1556   912 S  0.0  0.2  0:06.49
> >> /usr/bin/monit -c
> >>>>>>>> /etc/monit/monitrc
> >>>>>>>>> 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
> >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>>> 3079 root        0 -20 16864  4592  3508 S  0.0  0.5  0:01.51
> /usr/bin/atop
> >> -a
> >>>> -
> >>>>>> w
> >>>>>>>> /var/log/atop/atop_20140306 6
> >>>>>>>>> 445 syslog     20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
> >>>>>>>>> 4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03
> >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>>>   1 root       20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
> >>>>>>>>> 453 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
> >>>>>>>>> 451 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
> >>>>>>>>> 4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38
> >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>>> 4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38
> >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>>> 4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37
> >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>>> 23315 root       20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
> >>>>>>>>> 4367 kamailio   20   0  291M 10000  4864 S  0.0  1.0  0:00.36
> >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> My questions:
> >>>>>>>>> -   Is this a cororync or pacameker issue?
> >>>>>>>>> -   What are the CPG messages? Is it possible that we have a
> firewall
> >>>>>> issue?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Any hints would be great!
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Attila
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>>>>>
> >>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
> >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>>>
> >>>>>>> Project Home: http://www.clusterlabs.org Getting started:
> >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>
> >>>>> Project Home: http://www.clusterlabs.org Getting started:
> >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org