[Pacemaker] Managing big number of globally-unique clone instances
Andrew Beekhof
andrew at beekhof.net
Tue Jul 22 02:44:59 CEST 2014
On 21 Jul 2014, at 11:07 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
> 21.07.2014 13:37, Andrew Beekhof wrote:
>>
>> On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>
>>> 21.07.2014 06:21, Andrew Beekhof wrote:
>>>>
>>>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>>
>>>>> Hi Andrew, all,
>>>>>
>>>>> I have a task which seems to be easily solvable with the use of
>>>>> globally-unique clone: start huge number of specific virtual machines to
>>>>> provide a load to a connection multiplexer.
>>>>>
>>>>> I decided to look how pacemaker behaves in such setup with Dummy
>>>>> resource agent, and found that handling of every instance in an
>>>>> "initial" transition (probe+start) slows down with increase of clone-max.
>>>>
>>>> "yep"
>>>>
>>>> for non unique clones the number of probes needed is N, where N is the number of nodes.
>>>> for unique clones, we must test every instance and node combination, or N*M, where M is clone-max.
>>>>
>>>> And that's just the running of the probes... just figuring out which nodes need to be
>>>> probed is incredibly resource intensive (run crm_simulate and it will be painfully obvious).
>>>>
>>>>>
>>>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance.
>>>>> After I added 768 more instances (set clone-max to 1024)
>>
>> How many nodes though?
>
> Two nodes run in VMs.
>
>> Assuming 3, thats still only ~1s per operation (including the time taken to send the operation across the network twice and update the cib).
>>
>>>>> together with
>>>>> increasing batch-limit to 512, transition took almost an hour (3507
>>>>> seconds), or ~4.57s per added instance. Even if I take in account that
>>>>> monitoring of already started instances consumes some resources, last
>>>>> number seems to be rather big,
>>>
>>> I believe this ^ is the main point.
>>> If with N instances probe/start of _each_ instance takes X time slots,
>>> then with 4*N instances probe/start of _each_ instance takes ~5*X time
>>> slots. In an ideal world, I would expect it to remain constant.
>>
>> Unless you have 512 cores in the cluster, increasing the batch-limit in this way is certainly not going to give you the results you're looking for.
>> Firing more tasks at a machine just ends up in producing more context switches as the kernel tries to juggle the various tasks.
>>
>> More context switches == more CPU wasted == more time taken overall == completely consistent with your results.
>
> Thanks to the oprofile, I was able to gain speedup by 8-9% with following patch:
> =========
> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
> index 2167370..c612718 100644
> --- a/crmd/te_utils.c
> +++ b/crmd/te_utils.c
> @@ -374,8 +374,6 @@ te_graph_trigger(gpointer user_data)
> graph_rc = run_graph(transition_graph);
> transition_graph->batch_limit = limit; /* Restore the configured value */
>
> - print_graph(LOG_DEBUG_3, transition_graph);
> -
This one can go... it gets called every time an action finishes.
> if (graph_rc == transition_active) {
> crm_trace("Transition not yet complete");
> return TRUE;
> diff --git a/crmd/tengine.c b/crmd/tengine.c
> index 765628c..ec0e1d4 100644
> --- a/crmd/tengine.c
> +++ b/crmd/tengine.c
> @@ -221,7 +221,6 @@ do_te_invoke(long long action,
> }
>
> trigger_graph();
> - print_graph(LOG_DEBUG_2, transition_graph);
This is once per transition though... shouldn't hurt much.
>
> if (graph_data != input->xml) {
> free_xml(graph_data);
> =========
>
> Results this time are measured only for clean start op, after probes are done (add
> stopped clone, wait for probes to complete and then start clone).
> 256(vanilla): 09:51:50 - 09:53:17 => 1:27 = 87s => 0.33984375 s per instance
> 1024(vanilla): 10:17:10 - 10:34:34 => 17:24 = 1044s => 1.01953125 s per instance
> 1024(patched): 11:59:26 - 12:15:12 => 15:46 = 946s => 0.92382813 s per instance
>
> So, still not perfect, but better.
>
> Unfortunately, my binaries are build with optimization, so I'm not able to get call
> graphs yet.
>
> Also, as I run in VMs, no hardware support for oprofile is available, so results may be
> inaccurate a bit.
>
> Here is system-wide opreport's top for unpatched crmd with 1024 instances:
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples % image name app name symbol name
> 429963 41.3351 no-vmlinux no-vmlinux /no-vmlinux
> 129533 12.4528 libxml2.so.2.7.6 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6
> 101326 9.7411 libc-2.12.so libc-2.12.so __strcmp_sse42
> 42524 4.0881 libtransitioner.so.2.0.1 libtransitioner.so.2.0.1 print_synapse
> 37062 3.5630 libc-2.12.so libc-2.12.so malloc_consolidate
> 23268 2.2369 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 find_entity
> 21416 2.0589 libc-2.12.so libc-2.12.so _int_malloc
> 18950 1.8218 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 crm_element_value
> 17482 1.6807 libfreebl3.so libfreebl3.so /lib64/libfreebl3.so
> 15350 1.4757 libc-2.12.so libc-2.12.so vfprintf
> 15016 1.4436 libqb.so.0.16.0 libqb.so.0.16.0 /usr/lib64/libqb.so.0.16.0
> 13189 1.2679 bash bash /bin/bash
> 11375 1.0936 libc-2.12.so libc-2.12.so _int_free
> 10762 1.0346 libtotem_pg.so.5.0.0 libtotem_pg.so.5.0.0 /usr/lib64/libtotem_pg.so.5.0.0
> 10345 0.9945 libc-2.12.so libc-2.12.so _IO_default_xsputn
> ...
>
> And with patch:
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples % image name app name symbol name
> 434810 46.2143 no-vmlinux no-vmlinux /no-vmlinux
> 125397 13.3280 libxml2.so.2.7.6 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6
> 85259 9.0619 libc-2.12.so libc-2.12.so __strcmp_sse42
> 33563 3.5673 libc-2.12.so libc-2.12.so malloc_consolidate
> 18885 2.0072 libc-2.12.so libc-2.12.so _int_malloc
> 16714 1.7765 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 crm_element_value
> 14966 1.5907 libfreebl3.so libfreebl3.so /lib64/libfreebl3.so
> 14510 1.5422 libc-2.12.so libc-2.12.so vfprintf
> 13664 1.4523 bash bash /bin/bash
> 13505 1.4354 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 find_entity
> 12605 1.3397 libqb.so.0.16.0 libqb.so.0.16.0 /usr/lib64/libqb.so.0.16.0
> 10855 1.1537 libc-2.12.so libc-2.12.so _int_free
> 9857 1.0477 libc-2.12.so libc-2.12.so _IO_default_xsputn
> ...
>
>
>>
>>> Otherwise we have an issue with scalability into this direction.
>>>
>>>>>
>>>>> Main CPU consumer on DC while transition is running is crmd, Its memory
>>>>> footprint is around 85Mb, resulting CIB size together with the status
>>>>> section is around 2Mb,
>>>>
>>>> You said CPU and then listed RAM...
>>>
>>> Something wrong with that? :)
>>> That just three distinct facts.
>>
>> I was expecting quantification of the relative CPU usage.
>> I was also expecting the PE to have massive spikes whenever a new transition is calculated.
>>
>>>
>>>>
>>>>>
>>>>> Could it be possible to optimize this use-case from your opinion with
>>>>> minimal efforts? Could it be optimized with just configuration? Or may
>>>>> it be some trivial development task, f.e replace one GList with
>>>>> GHashtable somewhere?
>>>>
>>>> Optimize: yes, Minimal: no
>>>>
>>>>>
>>>>> Sure I can look deeper and get any additional information, f.e. to get
>>>>> crmd profiling results if it is hard to get an answer just from the head.
>>>>
>>>> Perhaps start looking in clone_create_probe()
>>>
>>> Got it, thanks for pointer!
>>>
>>>>
>>>>>
>>>>> Best,
>>>>> Vladislav
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140722/3cc0a4fb/attachment-0001.sig>
More information about the Pacemaker
mailing list