[Pacemaker] high cib load on config change

Thu Oct 11 00:17:00 UTC 2012

On Wed, Oct 10, 2012 at 6:44 PM, James Harper
<james.harper at bendigoit.com.au> wrote:
>> On 10/09/2012 01:42 PM, James Harper wrote:
>> > As per previous post, I'm seeing very high cib load whenever I make a
>> > configuration change, enough load that things timeout seemingly
>> > instantly. I thought this was happening well before the configured
>> > timeout but now I'm not so sure, maybe the timeouts are actually
>> > working okay and it just seems instant. If the timeouts are in fact
>> > working correctly then it's keeping the CPU at 100% for over 30
>> > seconds to the exclusion of any monitoring checks (or maybe locking
>> > the cib so the checks can't run?)
>> >
>> > When I make a change I see the likes of this sort of thing in the logs (see
>> data below email), which I thought might be solved by this
>> https://github.com/ClusterLabs/pacemaker/commit/10e9e579ab032bde393
>> 8d7f3e13c414e297ba3e9 but i just checked the 1.1.7 source that the Debian
>> packages are built from and it turns out that that patch already exists in 1.1.7.
>> >
>> > Are the messages below actually an indication of a problem? If I
>> understand it correctly it's failing to apply the configuration diff and is instead
>> forcing a full resync of the configuration across some or all nodes, which is
>> causing the high load.
>> >
>> > I ran the crm_report but it includes a lot of information I really need to
>> > remove so I'm reluctant to submit it in full unless it really all is required to
>> > resolve the problem.
>> >
>>
>> You already did some tuning like increasing batch-limit in your cluster
>> properties and increased corosync timings? Hard to say more without getting
>> more information ... if your configuration details are too sensitive to post on
>> a public mailing-list you can of course hire someone and give that information
>> under NDA ....
>>
>
> I guess I'd first like to know if the log entries I was seeing ("Failed application of an update diff" and "Requesting re-sync from peer") means that a full resync is being done, and if that's a problem or not.

There are occasions when its not a problem, but I don't think any of
them apply to you.

Questions:
- are you making any config changes when this behaviour is occurring?
- if so, from one node only or many?
- what version is this?  1.1.7 or 1.1.7 plus some debian patches? which patches?

> My understanding of my problem is that for whatever reason, a full resync is taking a lot more CPU that I would have expected, and is being triggered even for minor changes (eg adding a location to a resource). Resolving the former (if it's actually a problem?) would be nice, but resolving the latter would be acceptable for now.
>
> As for increasing batch limit, http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html tells me that this is "The number of jobs that the TE is allowed to execute in parallel.". If changing a single resource location completely consumes all CPU in all nodes for many seconds, is allowing more work to be done in parallel really going to help?
>
> For the corosync timings, are these the "token", "join", etc values in corosync.conf? I don't have any evidence in my logs that that layer is having any problems, although a _lot_ of logs are generated and I could easily miss something. That would only sidestep the issue though I think
>
> In any case I'll endeavour to clean up my logs as required and submit a bug report.
>
> Thanks for your time and patience
>
> James
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org