[Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed
Andrew Beekhof
andrew at beekhof.net
Tue May 27 01:56:20 CEST 2014
On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggomiz at cooperativaobrera.coop> wrote:
> Hello Andrew and cluster folks!
>
> In the last month we are experiencing some weird problem with cib process in one of our nodes
> ('gandalf'), it's a 4-node cluster. Brief description:
>
> After some undetermined reason (we still can't figure out why) it begins looping infinitely and
> consuming 100% CPU.
Apart from the CPU usage, is there something in particular that makes you think its looping?
There have been some big steps forward in cib for the next upstream release (its basically 2 orders of magnitude faster/more efficient).
Current versions will regularly max out a core, albeit for hopefully short periods of time depending on the cluster size:
https://twitter.com/beekhof/status/412913549837475840
Its also a vicious circle - a busy cib leads to failed resource actions, which leads to recovery operations, which leads to more work for the cib.
Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine that benefitting greatly from the coming version.
I notice you're using a rhel package, are you a RH customer or is this on a clone?
Also, did anything specific happen prior to the CIB going nuts?
> After that crm_mon command can't connect to pacemaker returning this output:
>
> Could not establish cib_ro connection: Resource temporarily unavailable (11)
>
> Connection to cluster failed: Transport endpoint is not connected
>
> Pacemaker process look like this:
>
> [DB1] gandalf # psx | grep pacemaker
> root 25966 0.0 0.0 80480 2840 ? S May23 0:15 pacemakerd
> 496 25972 83.8 0.0 111632 27888 ? Rs May23 4045:07 \_ /usr/libexec/pacemaker/cib
> root 25973 0.0 0.0 101716 12424 ? Ss May23 0:19 \_ /usr/libexec/pacemaker/stonithd
> root 25974 0.0 0.0 76644 3552 ? Ss May23 0:30 \_ /usr/libexec/pacemaker/lrmd
> 496 25975 0.0 0.0 89624 3368 ? Ss May23 0:15 \_ /usr/libexec/pacemaker/attrd
> 496 25976 0.0 0.0 81172 2568 ? Ss May23 0:14 \_ /usr/libexec/pacemaker/pengine
> root 25977 0.0 0.0 107700 7116 ? Ss May23 0:17 \_ /usr/libexec/pacemaker/crmd
>
> Cluster is still operating normally with all resources running and this node is reported alive in
> the other 3 members:
>
> [VM2] lorien # crm_mon -1
> Last updated: Mon May 26 16:38:18 2014
> Last change: Fri May 23 08:02:17 2014 via cibadmin on lorien.san01.cooperativaobrera.coop
> Stack: cman
> Current DC: lorien.san01.cooperativaobrera.coop - partition with quorum
> Version: 1.1.10-14.el6_5.2-368c726
> 4 Nodes configured
> 87 Resources configured
>
> Online: [ gandalf.san01.cooperativaobrera.coop isildur.san01.cooperativaobrera.coop
> lorien.san01.cooperativaobrera.coop mordor.san01.cooperativaobrera.coop ]
>
> In this moment the node is in that state, I don't want to move resources because I don't know how
> the cluster will react in this state. Please if you want me to make some tests or collect logs I'll
> leave the node in that state to make any test you want.
>
> Logs stopped just before cib process started looping. Last messages are:
>
> May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop cib: info:
> crm_compress_string: Compressed 258760 bytes into 14072 (ratio 18:1) in 67ms
> May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop cib: info:
> crm_client_destroy: Destroying 0 events
> May 23 21:03:01 [25972] gandalf.cooperativaobrera.coop cib: info: crm_client_new:
> Connecting 0x2ddec30 for uid=0 gid=0 pid=17759 id=2ce51a5a-a70e-4b24-8726-b97b6f9013fd
>
> Finally, cib process in this state can't be killed. Not even with "-9". We have to reboot the node
> to clean pacemaker and start again.
>
> System is CentOS 6, with official packages. We had version pacemaker-1.1.10-14.el6_5.2.x86_64. After
> the last reboot we upgraded to pacemaker-1.1.10-14.el6_5.3.x86_64 and problem still exists.
>
> Have any of you experienced something like this?
>
> Thanks in advance for any help!
>
> Cheers
>
> --
> Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
> ggomiz at cooperativaobrera.coop
> Gerencia de Sistemas - Cooperativa Obrera Ltda.
> Tel: (0291) 403-9700
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140527/d2c57319/attachment.sig>
More information about the Pacemaker
mailing list