[Pacemaker] Pacemaker/corosync freeze

Thu Mar 6 19:31:30 CET 2014

Hello,

We have a strange issue with Corosync/Pacemaker.
>From time to time, something unexpected happens and suddenly the crm_mon output remains static.
When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes.

In such a case, this high CPU usage is happening on all 7 nodes.
I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed.

Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

Logs are usually flooded with CPG related messages, such as:

Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0 CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:49 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0 CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0 CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1       crmd:     info: crm_cs_flush:       Sent 0 CPG messages  (1 remaining, last=8): Try again (6)

OR

Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1        cib:     info: crm_cs_flush:        Sent 0 CPG messages  (1 remaining, last=10933): Try again (

HTOP show something like this (sorted by TIME+ descending):

  1  [||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 59, 4 thr; 2 running
  2  [|                                         0.7%]     Load average: 1.00 0.99 1.02
  Mem[||||||||||||||||||||||||||||||||     165/994MB]     Uptime: 1 day, 10:22:03
  Swp[                                       0/509MB]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  921 root       20   0  188M 49220 33856 R  0.0  4.8  3h33:58 /usr/sbin/corosync
1277 snmp       20   0 45708  4248  1472 S  0.0  0.4  1:33.07 /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp -g snm
1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71 /usr/lib/pacemaker/cib
1312 root       20   0  104M  7484  3780 S  0.0  0.7  0:38.06 /usr/lib/pacemaker/stonithd
1611 root       -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 /usr/sbin/watchdog
1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62 /usr/lib/pacemaker/crmd
1313 root       20   0 81784  3800  2876 S  0.0  0.4  0:18.64 /usr/lib/pacemaker/lrmd
1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01 /usr/lib/pacemaker/attrd
1309 root       20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
1250 root       20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read process
1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25 /usr/lib/pacemaker/pengine
1252 root       20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write process
1835 ntp        20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112
  899 root       20   0 19168   700   488 S  0.0  0.1  0:09.75 /usr/sbin/irqbalance
1642 root       20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit -c /etc/monit/monitrc
4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
3079 root        0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6
  445 syslog     20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
    1 root       20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
  453 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
  451 syslog     20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
23315 root       20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
4367 kamailio   20   0  291M 10000  4864 S  0.0  1.0  0:00.36 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili

My questions:

-   Is this a cororync or pacameker issue?

-   What are the CPG messages? Is it possible that we have a firewall issue?

Any hints would be great!

Thanks,
Attila
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140306/91cd4404/attachment-0001.html>