[Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed

Gabriel Gomiz ggomiz at cooperativaobrera.coop
Mon Jun 9 19:56:12 EDT 2014


On 05/30/2014 12:12 AM, Andrew Beekhof wrote:
> There have been some big steps forward in cib for the next upstream release (its basically 2 orders of magnitude faster/more efficient).
> Current versions will regularly max out a core, albeit for hopefully short periods of time depending on the cluster size:
>
> 	https://twitter.com/beekhof/status/412913549837475840
>
> Its also a vicious circle - a busy cib leads to failed resource actions, which leads to recovery operations, which leads to more work for the cib.
>
> Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine that benefitting greatly from the coming version.
>
> I notice you're using a rhel package, are you a RH customer or is this on a clone?
Clone. CentOS.
> Also, did anything specific happen prior to the CIB going nuts?
>> Only thing that I can think of is a lot of calls to crm_mon via a shell script that we use to check
>> which resource groups each node is servicing (attached if you're curious).
>> We use this script to apply puppet manifests conditionally to our nodes and do some monitoring. Also
>> we have cron jobs checking via the script if the resource group is active before running.
>> Maybe the sum of that calls can make cib process very busy...?
> If you were running it every second... maybe. But something is _seriously_ wrong if -KILL isn't working!
> I wonder how much memory it was using at the time... perhaps the kernel was trying to write a huge core file?
I don't think so. It was several days in that state.

Is there any way to check if a node has a resource group via a single simple call to crm resource?
Because I didn't found a way we had to make a script that parse the entire crm_mon output.
>
>> Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the cluster. Will let you know if
>> there is something weird after this upgrade.
> Ok, I'd be interested to hear your feedback.

1.1.12 rc1 working flawlessly until now. So it looks like it's fixed in that version.

Thanks!


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140609/57ad385e/attachment-0002.sig>


More information about the Pacemaker mailing list