[Pacemaker] Managing big number of globally-unique clone instances

Mon Jul 21 15:07:46 CEST 2014

21.07.2014 13:37, Andrew Beekhof wrote:
> 
> On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
> 
>> 21.07.2014 06:21, Andrew Beekhof wrote:
>>>
>>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>
>>>> Hi Andrew, all,
>>>>
>>>> I have a task which seems to be easily solvable with the use of
>>>> globally-unique clone: start huge number of specific virtual machines to
>>>> provide a load to a connection multiplexer.
>>>>
>>>> I decided to look how pacemaker behaves in such setup with Dummy
>>>> resource agent, and found that handling of every instance in an
>>>> "initial" transition (probe+start) slows down with increase of clone-max.
>>>
>>> "yep"
>>>
>>> for non unique clones the number of probes needed is N, where N is the number of nodes.
>>> for unique clones, we must test every instance and node combination, or N*M, where M is clone-max.
>>>
>>> And that's just the running of the probes... just figuring out which nodes need to be
>>> probed is incredibly resource intensive (run crm_simulate and it will be painfully obvious). 
>>>
>>>>
>>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance.
>>>> After I added 768 more instances (set clone-max to 1024)
> 
> How many nodes though?

Two nodes run in VMs.

> Assuming 3, thats still only ~1s per operation (including the time taken to send the operation across the network twice and update the cib).
> 
>>>> together with
>>>> increasing batch-limit to 512, transition took almost an hour (3507
>>>> seconds), or ~4.57s per added instance. Even if I take in account that
>>>> monitoring of already started instances consumes some resources, last
>>>> number seems to be rather big,
>>
>> I believe this ^ is the main point.
>> If with N instances probe/start of _each_ instance takes X time slots,
>> then with 4*N instances probe/start of _each_ instance takes ~5*X time
>> slots. In an ideal world, I would expect it to remain constant.
> 
> Unless you have 512 cores in the cluster, increasing the batch-limit in this way is certainly not going to give you the results you're looking for.
> Firing more tasks at a machine just ends up in producing more context switches as the kernel tries to juggle the various tasks.
> 
> More context switches == more CPU wasted == more time taken overall == completely consistent with your results. 

Thanks to the oprofile, I was able to gain speedup by 8-9% with following patch:
=========

diff --git a/crmd/te_utils.c b/crmd/te_utils.c
index 2167370..c612718 100644
--- a/crmd/te_utils.c
+++ b/crmd/te_utils.c
@@ -374,8 +374,6 @@ te_graph_trigger(gpointer user_data)
         graph_rc = run_graph(transition_graph);
         transition_graph->batch_limit = limit; /* Restore the configured value */
 
-        print_graph(LOG_DEBUG_3, transition_graph);
-
         if (graph_rc == transition_active) {
             crm_trace("Transition not yet complete");
             return TRUE;
diff --git a/crmd/tengine.c b/crmd/tengine.c
index 765628c..ec0e1d4 100644
--- a/crmd/tengine.c
+++ b/crmd/tengine.c
@@ -221,7 +221,6 @@ do_te_invoke(long long action,
         }
 
         trigger_graph();
-        print_graph(LOG_DEBUG_2, transition_graph);
 
         if (graph_data != input->xml) {
             free_xml(graph_data);
=========

Results this time are measured only for clean start op, after probes are done (add
stopped clone, wait for probes to complete and then start clone).
 256(vanilla): 09:51:50 - 09:53:17 =>  1:27 =   87s => 0.33984375 s per instance
1024(vanilla): 10:17:10 - 10:34:34 => 17:24 = 1044s => 1.01953125 s per instance
1024(patched): 11:59:26 - 12:15:12 => 15:46 =  946s => 0.92382813 s per instance

So, still not perfect, but better.

Unfortunately, my binaries are build with optimization, so I'm not able to get call
graphs yet.

Also, as I run in VMs, no hardware support for oprofile is available, so results may be
inaccurate a bit.

Here is system-wide opreport's top for unpatched crmd with 1024 instances:
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               app name                 symbol name
429963   41.3351  no-vmlinux               no-vmlinux               /no-vmlinux
129533   12.4528  libxml2.so.2.7.6         libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
101326    9.7411  libc-2.12.so             libc-2.12.so             __strcmp_sse42
42524     4.0881  libtransitioner.so.2.0.1 libtransitioner.so.2.0.1 print_synapse
37062     3.5630  libc-2.12.so             libc-2.12.so             malloc_consolidate
23268     2.2369  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    find_entity
21416     2.0589  libc-2.12.so             libc-2.12.so             _int_malloc
18950     1.8218  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    crm_element_value
17482     1.6807  libfreebl3.so            libfreebl3.so            /lib64/libfreebl3.so
15350     1.4757  libc-2.12.so             libc-2.12.so             vfprintf
15016     1.4436  libqb.so.0.16.0          libqb.so.0.16.0          /usr/lib64/libqb.so.0.16.0
13189     1.2679  bash                     bash                     /bin/bash
11375     1.0936  libc-2.12.so             libc-2.12.so             _int_free
10762     1.0346  libtotem_pg.so.5.0.0     libtotem_pg.so.5.0.0     /usr/lib64/libtotem_pg.so.5.0.0
10345     0.9945  libc-2.12.so             libc-2.12.so             _IO_default_xsputn
...

And with patch:
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               app name                 symbol name
434810   46.2143  no-vmlinux               no-vmlinux               /no-vmlinux
125397   13.3280  libxml2.so.2.7.6         libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
85259     9.0619  libc-2.12.so             libc-2.12.so             __strcmp_sse42
33563     3.5673  libc-2.12.so             libc-2.12.so             malloc_consolidate
18885     2.0072  libc-2.12.so             libc-2.12.so             _int_malloc
16714     1.7765  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    crm_element_value
14966     1.5907  libfreebl3.so            libfreebl3.so            /lib64/libfreebl3.so
14510     1.5422  libc-2.12.so             libc-2.12.so             vfprintf
13664     1.4523  bash                     bash                     /bin/bash
13505     1.4354  libcrmcommon.so.3.2.0    libcrmcommon.so.3.2.0    find_entity
12605     1.3397  libqb.so.0.16.0          libqb.so.0.16.0          /usr/lib64/libqb.so.0.16.0
10855     1.1537  libc-2.12.so             libc-2.12.so             _int_free
9857      1.0477  libc-2.12.so             libc-2.12.so             _IO_default_xsputn
...


> 
>> Otherwise we have an issue with scalability into this direction.
>>
>>>>
>>>> Main CPU consumer on DC while transition is running is crmd, Its memory
>>>> footprint is around 85Mb, resulting CIB size together with the status
>>>> section is around 2Mb,
>>>
>>> You said CPU and then listed RAM...
>>
>> Something wrong with that? :)
>> That just three distinct facts.
> 
> I was expecting quantification of the relative CPU usage.
> I was also expecting the PE to have massive spikes whenever a new transition is calculated.
> 
>>
>>>
>>>>
>>>> Could it be possible to optimize this use-case from your opinion with
>>>> minimal efforts? Could it be optimized with just configuration? Or may
>>>> it be some trivial development task, f.e replace one GList with
>>>> GHashtable somewhere?
>>>
>>> Optimize: yes, Minimal: no
>>>
>>>>
>>>> Sure I can look deeper and get any additional information, f.e. to get
>>>> crmd profiling results if it is hard to get an answer just from the head.
>>>
>>> Perhaps start looking in clone_create_probe()
>>
>> Got it, thanks for pointer!
>>
>>>
>>>>
>>>> Best,
>>>> Vladislav
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>