[Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed

Tue May 27 01:56:20 CEST 2014

On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggomiz at cooperativaobrera.coop> wrote:

> Hello Andrew and cluster folks!
> 
> In the last month we are experiencing some weird problem with cib process in one of our nodes
> ('gandalf'), it's a 4-node cluster. Brief description:
> 
> After some undetermined reason (we still can't figure out why) it begins looping infinitely and
> consuming 100% CPU.

Apart from the CPU usage, is there something in particular that makes you think its looping?
There have been some big steps forward in cib for the next upstream release (its basically 2 orders of magnitude faster/more efficient).
Current versions will regularly max out a core, albeit for hopefully short periods of time depending on the cluster size:

	https://twitter.com/beekhof/status/412913549837475840

Its also a vicious circle - a busy cib leads to failed resource actions, which leads to recovery operations, which leads to more work for the cib.

Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine that benefitting greatly from the coming version.

I notice you're using a rhel package, are you a RH customer or is this on a clone?
Also, did anything specific happen prior to the CIB going nuts?

> After that crm_mon command can't connect to pacemaker returning this output:
> 
> Could not establish cib_ro connection: Resource temporarily unavailable (11)
> 
> Connection to cluster failed: Transport endpoint is not connected
> 
> Pacemaker process look like this:
> 
> [DB1] gandalf # psx | grep pacemaker
> root     25966  0.0  0.0  80480  2840 ?        S    May23   0:15 pacemakerd
> 496      25972 83.8  0.0 111632 27888 ?        Rs   May23 4045:07  \_ /usr/libexec/pacemaker/cib
> root     25973  0.0  0.0 101716 12424 ?        Ss   May23   0:19  \_ /usr/libexec/pacemaker/stonithd
> root     25974  0.0  0.0  76644  3552 ?        Ss   May23   0:30  \_ /usr/libexec/pacemaker/lrmd
> 496      25975  0.0  0.0  89624  3368 ?        Ss   May23   0:15  \_ /usr/libexec/pacemaker/attrd
> 496      25976  0.0  0.0  81172  2568 ?        Ss   May23   0:14  \_ /usr/libexec/pacemaker/pengine
> root     25977  0.0  0.0 107700  7116 ?        Ss   May23   0:17  \_ /usr/libexec/pacemaker/crmd
> 
> Cluster is still operating normally with all resources running and this node is reported alive in
> the other 3 members:
> 
> [VM2] lorien # crm_mon -1
> Last updated: Mon May 26 16:38:18 2014
> Last change: Fri May 23 08:02:17 2014 via cibadmin on lorien.san01.cooperativaobrera.coop
> Stack: cman
> Current DC: lorien.san01.cooperativaobrera.coop - partition with quorum
> Version: 1.1.10-14.el6_5.2-368c726
> 4 Nodes configured
> 87 Resources configured
> 
> Online: [ gandalf.san01.cooperativaobrera.coop isildur.san01.cooperativaobrera.coop
> lorien.san01.cooperativaobrera.coop mordor.san01.cooperativaobrera.coop ]
> 
> In this moment the node is in that state, I don't want to move resources because I don't know how
> the cluster will react in this state. Please if you want me to make some tests or collect logs I'll
> leave the node in that state to make any test you want.
> 
> Logs stopped just before cib process started looping. Last messages are:
> 
> May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
> crm_compress_string:       Compressed 258760 bytes into 14072 (ratio 18:1) in 67ms
> May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
> crm_client_destroy:        Destroying 0 events
> May 23 21:03:01 [25972] gandalf.cooperativaobrera.coop        cib:     info: crm_client_new:   
> Connecting 0x2ddec30 for uid=0 gid=0 pid=17759 id=2ce51a5a-a70e-4b24-8726-b97b6f9013fd
> 
> Finally, cib process in this state can't be killed. Not even with "-9". We have to reboot the node
> to clean pacemaker and start again.
> 
> System is CentOS 6, with official packages. We had version pacemaker-1.1.10-14.el6_5.2.x86_64. After
> the last reboot we upgraded to pacemaker-1.1.10-14.el6_5.3.x86_64 and problem still exists.
> 
> Have any of you experienced something like this?
> 
> Thanks in advance for any help!
> 
> Cheers
> 
> -- 
> Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
> ggomiz at cooperativaobrera.coop
> Gerencia de Sistemas - Cooperativa Obrera Ltda.
> Tel: (0291) 403-9700
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140527/d2c57319/attachment.sig>