[Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed

Mon May 26 15:48:42 EDT 2014

Hello Andrew and cluster folks!

In the last month we are experiencing some weird problem with cib process in one of our nodes
('gandalf'), it's a 4-node cluster. Brief description:

After some undetermined reason (we still can't figure out why) it begins looping infinitely and
consuming 100% CPU. After that crm_mon command can't connect to pacemaker returning this output:

Could not establish cib_ro connection: Resource temporarily unavailable (11)

Connection to cluster failed: Transport endpoint is not connected

Pacemaker process look like this:

[DB1] gandalf # psx | grep pacemaker
root     25966  0.0  0.0  80480  2840 ?        S    May23   0:15 pacemakerd
496      25972 83.8  0.0 111632 27888 ?        Rs   May23 4045:07  \_ /usr/libexec/pacemaker/cib
root     25973  0.0  0.0 101716 12424 ?        Ss   May23   0:19  \_ /usr/libexec/pacemaker/stonithd
root     25974  0.0  0.0  76644  3552 ?        Ss   May23   0:30  \_ /usr/libexec/pacemaker/lrmd
496      25975  0.0  0.0  89624  3368 ?        Ss   May23   0:15  \_ /usr/libexec/pacemaker/attrd
496      25976  0.0  0.0  81172  2568 ?        Ss   May23   0:14  \_ /usr/libexec/pacemaker/pengine
root     25977  0.0  0.0 107700  7116 ?        Ss   May23   0:17  \_ /usr/libexec/pacemaker/crmd

Cluster is still operating normally with all resources running and this node is reported alive in
the other 3 members:

[VM2] lorien # crm_mon -1
Last updated: Mon May 26 16:38:18 2014
Last change: Fri May 23 08:02:17 2014 via cibadmin on lorien.san01.cooperativaobrera.coop
Stack: cman
Current DC: lorien.san01.cooperativaobrera.coop - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
4 Nodes configured
87 Resources configured

Online: [ gandalf.san01.cooperativaobrera.coop isildur.san01.cooperativaobrera.coop
lorien.san01.cooperativaobrera.coop mordor.san01.cooperativaobrera.coop ]

In this moment the node is in that state, I don't want to move resources because I don't know how
the cluster will react in this state. Please if you want me to make some tests or collect logs I'll
leave the node in that state to make any test you want.

Logs stopped just before cib process started looping. Last messages are:

May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
crm_compress_string:       Compressed 258760 bytes into 14072 (ratio 18:1) in 67ms
May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
crm_client_destroy:        Destroying 0 events
May 23 21:03:01 [25972] gandalf.cooperativaobrera.coop        cib:     info: crm_client_new:   
Connecting 0x2ddec30 for uid=0 gid=0 pid=17759 id=2ce51a5a-a70e-4b24-8726-b97b6f9013fd

Finally, cib process in this state can't be killed. Not even with "-9". We have to reboot the node
to clean pacemaker and start again.

System is CentOS 6, with official packages. We had version pacemaker-1.1.10-14.el6_5.2.x86_64. After
the last reboot we upgraded to pacemaker-1.1.10-14.el6_5.3.x86_64 and problem still exists.

Have any of you experienced something like this?

Thanks in advance for any help!

Cheers

-- 
Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
ggomiz at cooperativaobrera.coop
Gerencia de Sistemas - Cooperativa Obrera Ltda.
Tel: (0291) 403-9700

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140526/21176609/attachment-0002.sig>