[Pacemaker] pacemaker processes RSS growth

Thu Sep 6 09:58:45 UTC 2012

06.09.2012 12:39, Andrew Beekhof wrote:
> On Thu, Sep 6, 2012 at 5:33 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 06.09.2012 10:19, Andrew Beekhof wrote:
>>> On Thu, Sep 6, 2012 at 5:14 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>> Hi,
>>>>
>>>> I noticed that some pacemaker processes grow during operation (commit
>>>> 8535316). Running on top of corosync 2.0.1.
>>>> I notched RSS size (RES as htop reports) with interval of ~18 hours.
>>>> First column is notched after ~1 hour of operation.
>>>>
>>>> Results are:
>>>> pengine      23568       23780
>>>> crmd         15592       17420
>>>> cib          12356       12380
>>>> lrmd          4396       14128
>>>> stonithd      3812        3812
>>>> attrd         3240        3244
>>>> pacemakerd    3104        3104
> 
> What unit are these values?

kb

> 
>>>>
>>>> Cluster is totally static, except cluster-recheck-interval is set to
>>>> 3min. No actions had been taken after the first notch.
>>>>
>>>> This make me think of some slow memory leaks in crmd and lrmd, but I
>>>> can't say that for sure because of glib.
>>>>
>>>> I do not know if CTS or coverity cover this, so I can try to run under
>>>> valgrind if somebody give me instructions how to do that.
>>>
>>> Check the bottom of /etc/sysconfig/pacemaker :-)
>>
>> Heh :)
>>
>>> Valgrind takes a heavy toll though... perhaps set
>>
>> I know. That is a test cluster, so it doesn't matter.
>>
>>>
>>> export G_SLICE=always-malloc
>>
>> Yes, I know. That does not help to deal with one-time glib
>> initializations though, like gtype one.
> 
> Right, but they should be accounted for already in the first column.

They usually take too much space in valgrind reports. Pacemaker doesn't
seem to be affected much though.

> 
>> May be you have appropriate
>> .supp for all daemons? I see only cts.supp, ptest.supp and cli.supp.
> 
> The PE shouldn't be leaking at all.
> Every 5th commit to master runs the PE and cli (which also exercises
> most of the cib) regression tests with valgrind and it hasn't reported
> anything.
> Stonithd appears to be in the clear too.

Yep, that two are OK.

cib shows tons of reachable memory in finished child processes. Not
important but log is huge, so it is very hard to find real errors.

crmd has some errors, most annoying are following two (they repeat
constantly, and I see first one for other processes too):

==1737== Syscall param socketcall.sendto(msg) points to uninitialised
byte(s)
==1737==    at 0x7AE6002: send (in /lib64/libc-2.12.so)
==1737==    by 0x869FE7F: qb_ipc_us_send (in /usr/lib64/libqb.so.0.13.0)
==1737==    by 0x86A0FE5: qb_ipcc_us_setup_connect (in
/usr/lib64/libqb.so.0.13.0)
==1737==    by 0x869DD33: qb_ipcc_connect (in /usr/lib64/libqb.so.0.13.0)
==1737==    by 0x5885459: crm_ipc_connect (ipc.c:341)
==1737==    by 0x589DC27: mainloop_add_ipc_client (mainloop.c:613)
==1737==    by 0x545C842: cib_native_signon_raw (cib_native.c:222)
==1737==    by 0x41802D: do_cib_control (cib.c:167)
==1737==    by 0x407726: s_crmd_fsa_actions (fsa.c:315)
==1737==    by 0x408D85: s_crmd_fsa (fsa.c:250)
==1737==    by 0x4109CD: crm_fsa_trigger (callbacks.c:251)
==1737==    by 0x589C3C2: crm_trigger_dispatch (mainloop.c:105)
==1737==    by 0x81DAF0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DED54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x404D9D: crmd_init (main.c:139)
==1737==    by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so)
==1737==  Address 0x7feffd514 is on thread 1's stack
==1737==
==1737== Invalid read of size 8
==1737==    at 0x7B2FD44: __strspn_sse42 (in /lib64/libc-2.12.so)
==1737==    by 0x588AA53: crm_get_msec (utils.c:640)
==1737==    by 0x588AD28: check_time (utils.c:85)
==1737==    by 0x588C8E8: cluster_option (utils.c:215)
==1737==    by 0x588CB8D: verify_all_options (utils.c:287)
==1737==    by 0x40B9AA: config_query_callback (control.c:779)
==1737==    by 0x5457393: cib_native_callback (cib_utils.c:631)
==1737==    by 0x545CD86: cib_native_dispatch_internal (cib_native.c:120)
==1737==    by 0x589D71F: mainloop_gio_callback (mainloop.c:522)
==1737==    by 0x81DAF0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DED54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x404D9D: crmd_init (main.c:139)
==1737==    by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so)
==1737==  Address 0xc158bd0 is 0 bytes inside a block of size 4 alloc'd
==1737==    at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==1737==    by 0x7A7D871: strdup (in /lib64/libc-2.12.so)
==1737==    by 0x588CA70: cluster_option (utils.c:211)
==1737==    by 0x588CB8D: verify_all_options (utils.c:287)
==1737==    by 0x40B9AA: config_query_callback (control.c:779)
==1737==    by 0x5457393: cib_native_callback (cib_utils.c:631)
==1737==    by 0x545CD86: cib_native_dispatch_internal (cib_native.c:120)
==1737==    by 0x589D71F: mainloop_gio_callback (mainloop.c:522)
==1737==    by 0x81DAF0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DED54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x404D9D: crmd_init (main.c:139)
==1737==    by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so)

And I suspect some leakage here:

==1737== 51,314 (1,968 direct, 49,346 indirect) bytes in 82 blocks are
definitely lost in loss record 248 of 249
==1737==    at 0x4C25A28: calloc (vg_replace_malloc.c:467)
==1737==    by 0x5CB77E1: lrmd_key_value_add (lrmd_client.c:100)
==1737==    by 0x41E199: do_lrm_rsc_op (lrm.c:1701)
==1737==    by 0x420779: do_lrm_invoke (lrm.c:1450)
==1737==    by 0x40C423: send_msg_via_ipc (messages.c:939)
==1737==    by 0x40D2DF: relay_message (messages.c:454)
==1737==    by 0x40F827: route_message (messages.c:322)
==1737==    by 0x411A6E: crmd_ha_msg_filter (callbacks.c:96)
==1737==    by 0x40547F: crmd_ais_dispatch (corosync.c:108)
==1737==    by 0x5673569: pcmk_cpg_deliver (corosync.c:551)
==1737==    by 0x65B0EAE: cpg_dispatch (cpg.c:412)
==1737==    by 0x5670981: pcmk_cpg_dispatch (corosync.c:577)
==1737==    by 0x589D966: mainloop_gio_callback (mainloop.c:533)
==1737==    by 0x81DAF0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DE937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x81DED54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1737==    by 0x404D9D: crmd_init (main.c:139)
==1737==    by 0x7A1BCDC: (below main) (in /lib64/libc-2.12.so)

lrmd seems not to clean up gio channels properly:

==1734== 8,946 (8,520 direct, 426 indirect) bytes in 71 blocks are
definitely lost in loss record 147 of 152
==1734==    at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==1734==    by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x71C67F4: g_io_channel_unix_new (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x4E52470: mainloop_add_fd (mainloop.c:660)
==1734==    by 0x5067870: services_os_action_execute (services_linux.c:456)
==1734==    by 0x403AA6: lrmd_rsc_dispatch (lrmd.c:696)
==1734==    by 0x4E513C2: crm_trigger_dispatch (mainloop.c:105)
==1734==    by 0x7190F0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194D54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x402427: main (main.c:302)
==1734==
==1734== 8,946 (8,520 direct, 426 indirect) bytes in 71 blocks are
definitely lost in loss record 148 of 152
==1734==    at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==1734==    by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x71C67F4: g_io_channel_unix_new (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x4E52470: mainloop_add_fd (mainloop.c:660)
==1734==    by 0x50678AE: services_os_action_execute (services_linux.c:465)
==1734==    by 0x403AA6: lrmd_rsc_dispatch (lrmd.c:696)
==1734==    by 0x4E513C2: crm_trigger_dispatch (mainloop.c:105)
==1734==    by 0x7190F0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194D54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x402427: main (main.c:302)
==1734==
==1734== 65,394 (62,280 direct, 3,114 indirect) bytes in 519 blocks are
definitely lost in loss record 151 of 152
==1734==    at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==1734==    by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x71C67F4: g_io_channel_unix_new (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x4E52470: mainloop_add_fd (mainloop.c:660)
==1734==    by 0x5067870: services_os_action_execute (services_linux.c:456)
==1734==    by 0x50676B4: recurring_action_timer (services_linux.c:212)
==1734==    by 0x719161A: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7190F0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194D54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x402427: main (main.c:302)
==1734==
==1734== 65,394 (62,280 direct, 3,114 indirect) bytes in 519 blocks are
definitely lost in loss record 152 of 152
==1734==    at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==1734==    by 0x71997D2: g_malloc (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x71C67F4: g_io_channel_unix_new (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x4E52470: mainloop_add_fd (mainloop.c:660)
==1734==    by 0x50678AE: services_os_action_execute (services_linux.c:465)
==1734==    by 0x50676B4: recurring_action_timer (services_linux.c:212)
==1734==    by 0x719161A: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7190F0D: g_main_context_dispatch (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194937: ??? (in /lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x7194D54: g_main_loop_run (in
/lib64/libglib-2.0.so.0.2200.5)
==1734==    by 0x402427: main (main.c:302)

> 
> The only one I'm really concerned about is the lrmd.
> 
>>
>> Should I make full cluster restart or rolling one is ok?
> 
> To have the sysconfig values take effect?  Either.

To correctly find all possible leaks which depend on execution path.

> 
>>
>>>
>>> first to rule out glib's funky allocator.
>>>
>>>>
>>>> I can send CIB contents if needed.
>>>>
>>>> Vladislav
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>