[Pacemaker] Managing big number of globally-unique clone instances
Vladislav Bogdanov
bubble at hoster-ok.com
Mon Jul 21 13:07:46 UTC 2014
21.07.2014 13:37, Andrew Beekhof wrote:
>
> On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>
>> 21.07.2014 06:21, Andrew Beekhof wrote:
>>>
>>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>
>>>> Hi Andrew, all,
>>>>
>>>> I have a task which seems to be easily solvable with the use of
>>>> globally-unique clone: start huge number of specific virtual machines to
>>>> provide a load to a connection multiplexer.
>>>>
>>>> I decided to look how pacemaker behaves in such setup with Dummy
>>>> resource agent, and found that handling of every instance in an
>>>> "initial" transition (probe+start) slows down with increase of clone-max.
>>>
>>> "yep"
>>>
>>> for non unique clones the number of probes needed is N, where N is the number of nodes.
>>> for unique clones, we must test every instance and node combination, or N*M, where M is clone-max.
>>>
>>> And that's just the running of the probes... just figuring out which nodes need to be
>>> probed is incredibly resource intensive (run crm_simulate and it will be painfully obvious).
>>>
>>>>
>>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance.
>>>> After I added 768 more instances (set clone-max to 1024)
>
> How many nodes though?
Two nodes run in VMs.
> Assuming 3, thats still only ~1s per operation (including the time taken to send the operation across the network twice and update the cib).
>
>>>> together with
>>>> increasing batch-limit to 512, transition took almost an hour (3507
>>>> seconds), or ~4.57s per added instance. Even if I take in account that
>>>> monitoring of already started instances consumes some resources, last
>>>> number seems to be rather big,
>>
>> I believe this ^ is the main point.
>> If with N instances probe/start of _each_ instance takes X time slots,
>> then with 4*N instances probe/start of _each_ instance takes ~5*X time
>> slots. In an ideal world, I would expect it to remain constant.
>
> Unless you have 512 cores in the cluster, increasing the batch-limit in this way is certainly not going to give you the results you're looking for.
> Firing more tasks at a machine just ends up in producing more context switches as the kernel tries to juggle the various tasks.
>
> More context switches == more CPU wasted == more time taken overall == completely consistent with your results.
Thanks to the oprofile, I was able to gain speedup by 8-9% with following patch:
=========
diff --git a/crmd/te_utils.c b/crmd/te_utils.c
index 2167370..c612718 100644
--- a/crmd/te_utils.c
+++ b/crmd/te_utils.c
@@ -374,8 +374,6 @@ te_graph_trigger(gpointer user_data)
graph_rc = run_graph(transition_graph);
transition_graph->batch_limit = limit; /* Restore the configured value */
- print_graph(LOG_DEBUG_3, transition_graph);
-
if (graph_rc == transition_active) {
crm_trace("Transition not yet complete");
return TRUE;
diff --git a/crmd/tengine.c b/crmd/tengine.c
index 765628c..ec0e1d4 100644
--- a/crmd/tengine.c
+++ b/crmd/tengine.c
@@ -221,7 +221,6 @@ do_te_invoke(long long action,
}
trigger_graph();
- print_graph(LOG_DEBUG_2, transition_graph);
if (graph_data != input->xml) {
free_xml(graph_data);
=========
Results this time are measured only for clean start op, after probes are done (add
stopped clone, wait for probes to complete and then start clone).
256(vanilla): 09:51:50 - 09:53:17 => 1:27 = 87s => 0.33984375 s per instance
1024(vanilla): 10:17:10 - 10:34:34 => 17:24 = 1044s => 1.01953125 s per instance
1024(patched): 11:59:26 - 12:15:12 => 15:46 = 946s => 0.92382813 s per instance
So, still not perfect, but better.
Unfortunately, my binaries are build with optimization, so I'm not able to get call
graphs yet.
Also, as I run in VMs, no hardware support for oprofile is available, so results may be
inaccurate a bit.
Here is system-wide opreport's top for unpatched crmd with 1024 instances:
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
429963 41.3351 no-vmlinux no-vmlinux /no-vmlinux
129533 12.4528 libxml2.so.2.7.6 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6
101326 9.7411 libc-2.12.so libc-2.12.so __strcmp_sse42
42524 4.0881 libtransitioner.so.2.0.1 libtransitioner.so.2.0.1 print_synapse
37062 3.5630 libc-2.12.so libc-2.12.so malloc_consolidate
23268 2.2369 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 find_entity
21416 2.0589 libc-2.12.so libc-2.12.so _int_malloc
18950 1.8218 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 crm_element_value
17482 1.6807 libfreebl3.so libfreebl3.so /lib64/libfreebl3.so
15350 1.4757 libc-2.12.so libc-2.12.so vfprintf
15016 1.4436 libqb.so.0.16.0 libqb.so.0.16.0 /usr/lib64/libqb.so.0.16.0
13189 1.2679 bash bash /bin/bash
11375 1.0936 libc-2.12.so libc-2.12.so _int_free
10762 1.0346 libtotem_pg.so.5.0.0 libtotem_pg.so.5.0.0 /usr/lib64/libtotem_pg.so.5.0.0
10345 0.9945 libc-2.12.so libc-2.12.so _IO_default_xsputn
...
And with patch:
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
434810 46.2143 no-vmlinux no-vmlinux /no-vmlinux
125397 13.3280 libxml2.so.2.7.6 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6
85259 9.0619 libc-2.12.so libc-2.12.so __strcmp_sse42
33563 3.5673 libc-2.12.so libc-2.12.so malloc_consolidate
18885 2.0072 libc-2.12.so libc-2.12.so _int_malloc
16714 1.7765 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 crm_element_value
14966 1.5907 libfreebl3.so libfreebl3.so /lib64/libfreebl3.so
14510 1.5422 libc-2.12.so libc-2.12.so vfprintf
13664 1.4523 bash bash /bin/bash
13505 1.4354 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 find_entity
12605 1.3397 libqb.so.0.16.0 libqb.so.0.16.0 /usr/lib64/libqb.so.0.16.0
10855 1.1537 libc-2.12.so libc-2.12.so _int_free
9857 1.0477 libc-2.12.so libc-2.12.so _IO_default_xsputn
...
>
>> Otherwise we have an issue with scalability into this direction.
>>
>>>>
>>>> Main CPU consumer on DC while transition is running is crmd, Its memory
>>>> footprint is around 85Mb, resulting CIB size together with the status
>>>> section is around 2Mb,
>>>
>>> You said CPU and then listed RAM...
>>
>> Something wrong with that? :)
>> That just three distinct facts.
>
> I was expecting quantification of the relative CPU usage.
> I was also expecting the PE to have massive spikes whenever a new transition is calculated.
>
>>
>>>
>>>>
>>>> Could it be possible to optimize this use-case from your opinion with
>>>> minimal efforts? Could it be optimized with just configuration? Or may
>>>> it be some trivial development task, f.e replace one GList with
>>>> GHashtable somewhere?
>>>
>>> Optimize: yes, Minimal: no
>>>
>>>>
>>>> Sure I can look deeper and get any additional information, f.e. to get
>>>> crmd profiling results if it is hard to get an answer just from the head.
>>>
>>> Perhaps start looking in clone_create_probe()
>>
>> Got it, thanks for pointer!
>>
>>>
>>>>
>>>> Best,
>>>> Vladislav
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list