[Pacemaker] Pacemaker/corosync freeze

Wed Mar 12 03:45:47 EDT 2014

> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> Sent: Tuesday, March 11, 2014 10:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 12 Mar 2014, at 1:54 am, Attila Megyeri <amegyeri at minerva-soft.com>
> wrote:
> 
> >>
> >> -----Original Message-----
> >> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >> Sent: Tuesday, March 11, 2014 12:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri <amegyeri at minerva-soft.com>
> >> wrote:
> >>
> >>> Thanks for the quick response!
> >>>
> >>>> -----Original Message-----
> >>>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> >>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>>
> >>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>> <amegyeri at minerva-soft.com>
> >>>> wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>> From time to time, something unexpected happens and suddenly the
> >>>> crm_mon output remains static.
> >>>>> When I check the cpu usage, I see that one of the cores uses 100%
> >>>>> cpu, but
> >>>> cannot actually match it to either the corosync or one of the
> >>>> pacemaker processes.
> >>>>>
> >>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>> corosync, then
> >>>> start pacemeker. Stoping pacemaker and corosync does not work in
> >>>> most of the cases, usually a kill -9 is needed.
> >>>>>
> >>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>
> >>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
> >>>>>
> >>>>> Logs are usually flooded with CPG related messages, such as:
> >>>>>
> >>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>> Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>> Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>
> >>>>> OR
> >>>>>
> >>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0
> CPG
> >>>> messages  (1 remaining, last=10933): Try again (
> >>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0
> CPG
> >>>> messages  (1 remaining, last=10933): Try again (
> >>>>> Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0
> CPG
> >>>> messages  (1 remaining, last=10933): Try again (
> >>>>
> >>>> That is usually a symptom of corosync getting into a horribly
> >>>> confused
> >> state.
> >>>> Version? Distro? Have you checked for an update?
> >>>> Odd that the user of all that CPU isn't showing up though.
> >>>>
> >>>>>
> >>>
> >>> As I wrote I use Ubuntu trusty, the exact package versions are:
> >>>
> >>> corosync 2.3.0-1ubuntu5
> >>> pacemaker 1.1.10+git20130802-1ubuntu2
> >>
> >> Ah sorry, I seem to have missed that part.
> >>
> >>>
> >>> There are no updates available. The only option is to install from
> >>> sources,
> >> but that would be very difficult to maintain and I'm not sure I would
> >> get rid of this issue.
> >>>
> >>> What do you recommend?
> >>
> >> The same thing as Lars, or switch to a distro that stays current with
> >> upstream (git shows 5 newer releases for that branch since it was
> >> released 3 years ago).
> >> If you do build from source, its probably best to go with v1.4.6
> >
> > Hm, I am a bit confused here. We are using 2.3.0,
> 
> I swapped the 2 for a 1 somehow. A bit distracted, sorry.

I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like:

Mar 12 07:36:55 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:55 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0 CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:56 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:56 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0 CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:57 [4798] ctdb2       crmd:     info: crm_cs_flush:        Sent 0 CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (48 remaining, last=3671): Try again (6)

Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting?

Thank you in advance.

> 
> > which was released approx. a year ago (you mention 3 years) and you
> recommend 1.4.6, which is a rather old version.
> > Could you please clarify a bit? :)
> > Lars recommends 2.3.3 git tree.
> >
> > I might end up trying both, but just want to make sure I am not
> misunderstanding something badly.
> >
> > Thank you!
> >
> >
> >
> >
> >
> >
> >
> >
> >>
> >>>
> >>>
> >>>>>
> >>>>> HTOP show something like this (sorted by TIME+ descending):
> >>>>>
> >>>>>
> >>>>>
> >>>>> 1  [||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 59,
> 4
> >>>> thr; 2 running
> >>>>> 2  [|                                         0.7%]     Load average: 1.00 0.99 1.02
> >>>>> Mem[||||||||||||||||||||||||||||||||     165/994MB]     Uptime: 1
> >>>> day, 10:22:03
> >>>>> Swp[                                       0/509MB]
> >>>>>
> >>>>> PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> >>>>> 921 root       20   0  188M 49220 33856 R  0.0  4.8  3h33:58
> >> /usr/sbin/corosync
> >>>>> 1277 snmp       20   0 45708  4248  1472 S  0.0  0.4  1:33.07
> /usr/sbin/snmpd
> >> -
> >>>> Lsd -Lf /dev/null -u snmp -g snm
> >>>>> 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
> >>>> /usr/lib/pacemaker/cib
> >>>>> 1312 root       20   0  104M  7484  3780 S  0.0  0.7  0:38.06
> >>>> /usr/lib/pacemaker/stonithd
> >>>>> 1611 root       -2   0  4408  2356  2000 S  0.0  0.2  0:24.15
> /usr/sbin/watchdog
> >>>>> 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
> >>>> /usr/lib/pacemaker/crmd
> >>>>> 1313 root       20   0 81784  3800  2876 S  0.0  0.4  0:18.64
> >>>> /usr/lib/pacemaker/lrmd
> >>>>> 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
> >>>> /usr/lib/pacemaker/attrd
> >>>>> 1309 root       20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
> >>>>> 1250 root       20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read
> >> process
> >>>>> 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
> >>>> /usr/lib/pacemaker/pengine
> >>>>> 1252 root       20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write
> >> process
> >>>>> 1835 ntp        20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd -
> p
> >>>> /var/run/ntpd.pid -g -u 105:112
> >>>>> 899 root       20   0 19168   700   488 S  0.0  0.1  0:09.75
> /usr/sbin/irqbalance
> >>>>> 1642 root       20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit -c
> >>>> /etc/monit/monitrc
> >>>>> 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
> >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>> 3079 root        0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop -a
> -
> >> w
> >>>> /var/log/atop/atop_20140306 6
> >>>>> 445 syslog     20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
> >>>>> 4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03
> >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>   1 root       20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
> >>>>> 453 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
> >>>>> 451 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
> >>>>> 4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38
> >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>> 4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38
> >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>> 4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37
> >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>> 23315 root       20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
> >>>>> 4367 kamailio   20   0  291M 10000  4864 S  0.0  1.0  0:00.36
> >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> >>>>>
> >>>>>
> >>>>> My questions:
> >>>>> -   Is this a cororync or pacameker issue?
> >>>>> -   What are the CPG messages? Is it possible that we have a firewall
> >> issue?
> >>>>>
> >>>>>
> >>>>> Any hints would be great!
> >>>>>
> >>>>> Thanks,
> >>>>> Attila
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>
> >>>>> Project Home: http://www.clusterlabs.org Getting started:
> >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org