[Pacemaker] high cib load on config change

Wed Oct 10 03:44:37 EDT 2012

> On 10/09/2012 01:42 PM, James Harper wrote:
> > As per previous post, I'm seeing very high cib load whenever I make a
> > configuration change, enough load that things timeout seemingly
> > instantly. I thought this was happening well before the configured
> > timeout but now I'm not so sure, maybe the timeouts are actually
> > working okay and it just seems instant. If the timeouts are in fact
> > working correctly then it's keeping the CPU at 100% for over 30
> > seconds to the exclusion of any monitoring checks (or maybe locking
> > the cib so the checks can't run?)
> >
> > When I make a change I see the likes of this sort of thing in the logs (see
> data below email), which I thought might be solved by this
> https://github.com/ClusterLabs/pacemaker/commit/10e9e579ab032bde393
> 8d7f3e13c414e297ba3e9 but i just checked the 1.1.7 source that the Debian
> packages are built from and it turns out that that patch already exists in 1.1.7.
> >
> > Are the messages below actually an indication of a problem? If I
> understand it correctly it's failing to apply the configuration diff and is instead
> forcing a full resync of the configuration across some or all nodes, which is
> causing the high load.
> >
> > I ran the crm_report but it includes a lot of information I really need to
> > remove so I'm reluctant to submit it in full unless it really all is required to
> > resolve the problem.
> >
> 
> You already did some tuning like increasing batch-limit in your cluster
> properties and increased corosync timings? Hard to say more without getting
> more information ... if your configuration details are too sensitive to post on
> a public mailing-list you can of course hire someone and give that information
> under NDA ....
> 

I guess I'd first like to know if the log entries I was seeing ("Failed application of an update diff" and "Requesting re-sync from peer") means that a full resync is being done, and if that's a problem or not. My understanding of my problem is that for whatever reason, a full resync is taking a lot more CPU that I would have expected, and is being triggered even for minor changes (eg adding a location to a resource). Resolving the former (if it's actually a problem?) would be nice, but resolving the latter would be acceptable for now.

As for increasing batch limit, http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html tells me that this is "The number of jobs that the TE is allowed to execute in parallel.". If changing a single resource location completely consumes all CPU in all nodes for many seconds, is allowing more work to be done in parallel really going to help?

For the corosync timings, are these the "token", "join", etc values in corosync.conf? I don't have any evidence in my logs that that layer is having any problems, although a _lot_ of logs are generated and I could easily miss something. That would only sidestep the issue though I think

In any case I'll endeavour to clean up my logs as required and submit a bug report.

Thanks for your time and patience

James