[Pacemaker] Managing big number of globally-unique clone instances

Mon Jul 21 10:37:45 UTC 2014

On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:

> 21.07.2014 06:21, Andrew Beekhof wrote:
>> 
>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 
>>> Hi Andrew, all,
>>> 
>>> I have a task which seems to be easily solvable with the use of
>>> globally-unique clone: start huge number of specific virtual machines to
>>> provide a load to a connection multiplexer.
>>> 
>>> I decided to look how pacemaker behaves in such setup with Dummy
>>> resource agent, and found that handling of every instance in an
>>> "initial" transition (probe+start) slows down with increase of clone-max.
>> 
>> "yep"
>> 
>> for non unique clones the number of probes needed is N, where N is the number of nodes.
>> for unique clones, we must test every instance and node combination, or N*M, where M is clone-max.
>> 
>> And that's just the running of the probes... just figuring out which nodes need to be
>> probed is incredibly resource intensive (run crm_simulate and it will be painfully obvious). 
>> 
>>> 
>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance.
>>> After I added 768 more instances (set clone-max to 1024)

How many nodes though?
Assuming 3, thats still only ~1s per operation (including the time taken to send the operation across the network twice and update the cib).

>>> together with
>>> increasing batch-limit to 512, transition took almost an hour (3507
>>> seconds), or ~4.57s per added instance. Even if I take in account that
>>> monitoring of already started instances consumes some resources, last
>>> number seems to be rather big,
> 
> I believe this ^ is the main point.
> If with N instances probe/start of _each_ instance takes X time slots,
> then with 4*N instances probe/start of _each_ instance takes ~5*X time
> slots. In an ideal world, I would expect it to remain constant.

Unless you have 512 cores in the cluster, increasing the batch-limit in this way is certainly not going to give you the results you're looking for.
Firing more tasks at a machine just ends up in producing more context switches as the kernel tries to juggle the various tasks.

More context switches == more CPU wasted == more time taken overall == completely consistent with your results. 

> Otherwise we have an issue with scalability into this direction.
> 
>>> 
>>> Main CPU consumer on DC while transition is running is crmd, Its memory
>>> footprint is around 85Mb, resulting CIB size together with the status
>>> section is around 2Mb,
>> 
>> You said CPU and then listed RAM...
> 
> Something wrong with that? :)
> That just three distinct facts.

I was expecting quantification of the relative CPU usage.
I was also expecting the PE to have massive spikes whenever a new transition is calculated.

> 
>> 
>>> 
>>> Could it be possible to optimize this use-case from your opinion with
>>> minimal efforts? Could it be optimized with just configuration? Or may
>>> it be some trivial development task, f.e replace one GList with
>>> GHashtable somewhere?
>> 
>> Optimize: yes, Minimal: no
>> 
>>> 
>>> Sure I can look deeper and get any additional information, f.e. to get
>>> crmd profiling results if it is hard to get an answer just from the head.
>> 
>> Perhaps start looking in clone_create_probe()
> 
> Got it, thanks for pointer!
> 
>> 
>>> 
>>> Best,
>>> Vladislav
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140721/de6060e1/attachment-0004.sig>