[Pacemaker] Managing big number of globally-unique clone instances

Tue Jul 22 02:44:59 CEST 2014

On 21 Jul 2014, at 11:07 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:

> 21.07.2014 13:37, Andrew Beekhof wrote:
>> 
>> On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 
>>> 21.07.2014 06:21, Andrew Beekhof wrote:
>>>> 
>>>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>> 
>>>>> Hi Andrew, all,
>>>>> 
>>>>> I have a task which seems to be easily solvable with the use of
>>>>> globally-unique clone: start huge number of specific virtual machines to
>>>>> provide a load to a connection multiplexer.
>>>>> 
>>>>> I decided to look how pacemaker behaves in such setup with Dummy
>>>>> resource agent, and found that handling of every instance in an
>>>>> "initial" transition (probe+start) slows down with increase of clone-max.
>>>> 
>>>> "yep"
>>>> 
>>>> for non unique clones the number of probes needed is N, where N is the number of nodes.
>>>> for unique clones, we must test every instance and node combination, or N*M, where M is clone-max.
>>>> 
>>>> And that's just the running of the probes... just figuring out which nodes need to be
>>>> probed is incredibly resource intensive (run crm_simulate and it will be painfully obvious). 
>>>> 
>>>>> 
>>>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance.
>>>>> After I added 768 more instances (set clone-max to 1024)
>> 
>> How many nodes though?
> 
> Two nodes run in VMs.
> 
>> Assuming 3, thats still only ~1s per operation (including the time taken to send the operation across the network twice and update the cib).
>> 
>>>>> together with
>>>>> increasing batch-limit to 512, transition took almost an hour (3507
>>>>> seconds), or ~4.57s per added instance. Even if I take in account that
>>>>> monitoring of already started instances consumes some resources, last
>>>>> number seems to be rather big,
>>> 
>>> I believe this ^ is the main point.
>>> If with N instances probe/start of _each_ instance takes X time slots,
>>> then with 4*N instances probe/start of _each_ instance takes ~5*X time
>>> slots. In an ideal world, I would expect it to remain constant.
>> 
>> Unless you have 512 cores in the cluster, increasing the batch-limit in this way is certainly not going to give you the results you're looking for.
>> Firing more tasks at a machine just ends up in producing more context switches as the kernel tries to juggle the various tasks.
>> 
>> More context switches == more CPU wasted == more time taken overall == completely consistent with your results. 
> 
> Thanks to the oprofile, I was able to gain speedup by 8-9% with following patch:
> =========
> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
> index 2167370..c612718 100644
> --- a/crmd/te_utils.c
> +++ b/crmd/te_utils.c
> @@ -374,8 +374,6 @@ te_graph_trigger(gpointer user_data)
>         graph_rc = run_graph(transition_graph);
>         transition_graph->batch_limit = limit; /* Restore the configured value */
> 
> -        print_graph(LOG_DEBUG_3, transition_graph);
> -

This one can go... it gets called every time an action finishes.

>         if (graph_rc == transition_active) {
>             crm_trace("Transition not yet complete");
>             return TRUE;
> diff --git a/crmd/tengine.c b/crmd/tengine.c
> index 765628c..ec0e1d4 100644
> --- a/crmd/tengine.c
> +++ b/crmd/tengine.c
> @@ -221,7 +221,6 @@ do_te_invoke(long long action,
>         }
> 
>         trigger_graph();
> -        print_graph(LOG_DEBUG_2, transition_graph);

This is once per transition though... shouldn't hurt much.

> 
>         if (graph_data != input->xml) {
>             free_xml(graph_data);
> =========
> 
> Results this time are measured only for clean start op, after probes are done (add
> stopped clone, wait for probes to complete and then start clone).
> 256(vanilla): 09:51:50 - 09:53:17 =>  1:27 =   87s => 0.33984375 s per instance
> 1024(vanilla): 10:17:10 - 10:34:34 => 17:24 = 1044s => 1.01953125 s per instance
> 1024(patched): 11:59:26 - 12:15:12 => 15:46 =  946s => 0.92382813 s per instance
> 
> So, still not perfect, but better.
> 
> Unfortunately, my binaries are build with optimization, so I'm not able to get call
> graphs yet.
> 
> Also, as I run in VMs, no hardware support for oprofile is available, so results may be
> inaccurate a bit.
> 
> Here is system-wide opreport's top for unpatched crmd with 1024 instances:
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               app name                 symbol name
> 429963   41.3351  no-vmlinux               no-vmlinux               /no-vmlinux
> 129533   12.4528  libxml2.so.2.7.6         libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
> 101326    9.7411  libc-2.12.so             libc-2.12.so             __strcmp_sse42
> 42524     4.0881  libtransitioner.so.2.0.1 libtransitioner.so.2.0.1 print_synapse
> 37062     3.5630  libc-2.12.so             libc-2.12.so             malloc_consolidate
> 23268     2.2369  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    find_entity
> 21416     2.0589  libc-2.12.so             libc-2.12.so             _int_malloc
> 18950     1.8218  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    crm_element_value
> 17482     1.6807  libfreebl3.so            libfreebl3.so            /lib64/libfreebl3.so
> 15350     1.4757  libc-2.12.so             libc-2.12.so             vfprintf
> 15016     1.4436  libqb.so.0.16.0          libqb.so.0.16.0          /usr/lib64/libqb.so.0.16.0
> 13189     1.2679  bash                     bash                     /bin/bash
> 11375     1.0936  libc-2.12.so             libc-2.12.so             _int_free
> 10762     1.0346  libtotem_pg.so.5.0.0     libtotem_pg.so.5.0.0     /usr/lib64/libtotem_pg.so.5.0.0
> 10345     0.9945  libc-2.12.so             libc-2.12.so             _IO_default_xsputn
> ...
> 
> And with patch:
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               app name                 symbol name
> 434810   46.2143  no-vmlinux               no-vmlinux               /no-vmlinux
> 125397   13.3280  libxml2.so.2.7.6         libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
> 85259     9.0619  libc-2.12.so             libc-2.12.so             __strcmp_sse42
> 33563     3.5673  libc-2.12.so             libc-2.12.so             malloc_consolidate
> 18885     2.0072  libc-2.12.so             libc-2.12.so             _int_malloc
> 16714     1.7765  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    crm_element_value
> 14966     1.5907  libfreebl3.so            libfreebl3.so            /lib64/libfreebl3.so
> 14510     1.5422  libc-2.12.so             libc-2.12.so             vfprintf
> 13664     1.4523  bash                     bash                     /bin/bash
> 13505     1.4354  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    find_entity
> 12605     1.3397  libqb.so.0.16.0          libqb.so.0.16.0          /usr/lib64/libqb.so.0.16.0
> 10855     1.1537  libc-2.12.so             libc-2.12.so             _int_free
> 9857      1.0477  libc-2.12.so             libc-2.12.so             _IO_default_xsputn
> ...
> 
> 
>> 
>>> Otherwise we have an issue with scalability into this direction.
>>> 
>>>>> 
>>>>> Main CPU consumer on DC while transition is running is crmd, Its memory
>>>>> footprint is around 85Mb, resulting CIB size together with the status
>>>>> section is around 2Mb,
>>>> 
>>>> You said CPU and then listed RAM...
>>> 
>>> Something wrong with that? :)
>>> That just three distinct facts.
>> 
>> I was expecting quantification of the relative CPU usage.
>> I was also expecting the PE to have massive spikes whenever a new transition is calculated.
>> 
>>> 
>>>> 
>>>>> 
>>>>> Could it be possible to optimize this use-case from your opinion with
>>>>> minimal efforts? Could it be optimized with just configuration? Or may
>>>>> it be some trivial development task, f.e replace one GList with
>>>>> GHashtable somewhere?
>>>> 
>>>> Optimize: yes, Minimal: no
>>>> 
>>>>> 
>>>>> Sure I can look deeper and get any additional information, f.e. to get
>>>>> crmd profiling results if it is hard to get an answer just from the head.
>>>> 
>>>> Perhaps start looking in clone_create_probe()
>>> 
>>> Got it, thanks for pointer!
>>> 
>>>> 
>>>>> 
>>>>> Best,
>>>>> Vladislav
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140722/3cc0a4fb/attachment-0001.sig>