[Pacemaker] Pacemaker/corosync freeze
Attila Megyeri
amegyeri at minerva-soft.com
Thu Mar 6 18:31:30 UTC 2014
Hello,
We have a strange issue with Corosync/Pacemaker.
>From time to time, something unexpected happens and suddenly the crm_mon output remains static.
When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes.
In such a case, this high CPU usage is happening on all 7 nodes.
I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed.
Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
Logs are usually flooded with CPG related messages, such as:
Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6)
Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6)
OR
Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again (
HTOP show something like this (sorted by TIME+ descending):
1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 59, 4 thr; 2 running
2 [| 0.7%] Load average: 1.00 0.99 1.02
Mem[|||||||||||||||||||||||||||||||| 165/994MB] Uptime: 1 day, 10:22:03
Swp[ 0/509MB]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync
1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp -g snm
1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib
1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd
1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog
1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd
1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd
1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd
1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd
1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process
1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine
1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process
1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112
899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance
1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc
4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6
445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd
4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init
453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd
451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd
4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop
4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
My questions:
- Is this a cororync or pacameker issue?
- What are the CPG messages? Is it possible that we have a firewall issue?
Any hints would be great!
Thanks,
Attila
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140306/91cd4404/attachment-0003.html>
More information about the Pacemaker
mailing list