[Pacemaker] lrmd Memory Usage

Wed May 7 02:20:09 CEST 2014

On 6 May 2014, at 7:47 pm, Greg Murphy <greg.murphy at gamesparks.com> wrote:

> Here you go - I’ve only run lrmd for 30 minutes since installing the debug
> package, but hopefully that’s enough - if not, let me know and I’ll do a
> longer capture.
> 

I'll keep looking, but almost everything so far seems to be from or related to the g_dbus API:

...
==37625==    by 0x6F20E30: g_dbus_proxy_new_for_bus_sync (in /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.3800.1)
==37625==    by 0x507B90B: get_proxy (upstart.c:66)
==37625==    by 0x507B9BF: upstart_init (upstart.c:85)
==37625==    by 0x507C88E: upstart_job_exec (upstart.c:429)
==37625==    by 0x10CE03: lrmd_rsc_dispatch (lrmd.c:879)
==37625==    by 0x4E5F112: crm_trigger_dispatch (mainloop.c:105)
==37625==    by 0x58A13B5: g_main_context_dispatch (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1)
==37625==    by 0x58A1707: ??? (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1)
==37625==    by 0x58A1B09: g_main_loop_run (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1)
==37625==    by 0x10AC3A: main (main.c:314)

Which is going to be called every time an upstart job is run (ie. recurring monitor of an upstart resource)

There were several problems with that API and we removed all use of it in 1.1.11.
I'm quite confident that most, if not all, of the memory issues would go away if you upgraded.

> 
> 
> On 06/05/2014 10:08, "Andrew Beekhof" <andrew at beekhof.net> wrote:
> 
>> Oh, any any chance you could install the debug packages? It will make the
>> output even more useful :-)
>> 
>> On 6 May 2014, at 7:06 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>> 
>>> 
>>> On 6 May 2014, at 6:05 pm, Greg Murphy <greg.murphy at gamesparks.com>
>>> wrote:
>>> 
>>>> Attached are the valgrind outputs from two separate runs of lrmd with
>>>> the
>>>> suggested variables set. Do they help narrow the issue down?
>>> 
>>> They do somewhat.  I'll investigate.  But much of the memory is still
>>> reachable:
>>> 
>>> ==26203==    indirectly lost: 17,945,950 bytes in 642,546 blocks
>>> ==26203==      possibly lost: 2,805 bytes in 60 blocks
>>> ==26203==    still reachable: 26,104,781 bytes in 544,782 blocks
>>> ==26203==         suppressed: 8,652 bytes in 176 blocks
>>> ==26203== Reachable blocks (those to which a pointer was found) are not
>>> shown.
>>> ==26203== To see them, rerun with: --leak-check=full
>>> --show-reachable=yes
>>> 
>>> Could you add the --show-reachable=yes to VALGRIND_OPTS variable?
>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> 
>>>> Greg
>>>> 
>>>> 
>>>> On 02/05/2014 03:01, "Andrew Beekhof" <andrew at beekhof.net> wrote:
>>>> 
>>>>> 
>>>>> On 30 Apr 2014, at 9:01 pm, Greg Murphy <greg.murphy at gamesparks.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi
>>>>>> 
>>>>>> I¹m running a two-node Pacemaker cluster on Ubuntu Saucy (13.10),
>>>>>> kernel 3.11.0-17-generic and the Ubuntu Pacemaker package, version
>>>>>> 1.1.10+git20130802-1ubuntu1.
>>>>> 
>>>>> The problem is that I have no way of knowing what code is/isn't
>>>>> included
>>>>> in '1.1.10+git20130802-1ubuntu1'.
>>>>> You could try setting the following in your environment before
>>>>> starting
>>>>> pacemaker though
>>>>> 
>>>>> # Variables for running child daemons under valgrind and/or checking
>>>>> for
>>>>> memory problems
>>>>> G_SLICE=always-malloc
>>>>> MALLOC_PERTURB_=221 # or 0
>>>>> MALLOC_CHECK_=3     # or 0,1,2
>>>>> PCMK_valgrind_enabled=lrmd
>>>>> VALGRIND_OPTS="--leak-check=full --trace-children=no --num-callers=25
>>>>> --log-file=/var/lib/pacemaker/valgrind-%p
>>>>> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions
>>>>> --gen-suppressions=all"
>>>>> 
>>>>> 
>>>>>> The cluster is configured with a DRBD master/slave set and then a
>>>>>> failover resource group containing MySQL (along with its DRBD
>>>>>> filesystem) and a Zabbix Proxy and Agent.
>>>>>> 
>>>>>> Since I built the cluster around two months ago I¹ve noticed that on
>>>>>> the the active node the memory footprint of lrmd gradually grows to
>>>>>> quite a significant size. The cluster was last restarted three weeks
>>>>>> ago, and now lrmd has over 1GB of mapped memory on the active node
>>>>>> and
>>>>>> only 151MB on the passive node. Current excerpts from
>>>>>> /proc/PID/status
>>>>>> are:
>>>>>> 
>>>>>> Active node
>>>>>> VmPeak:
>>>>>> 1146740 kB
>>>>>> VmSize:
>>>>>> 1146740 kB
>>>>>> VmLck:
>>>>>>    0 kB
>>>>>> VmPin:
>>>>>>    0 kB
>>>>>> VmHWM:
>>>>>> 267680 kB
>>>>>> VmRSS:
>>>>>> 188764 kB
>>>>>> VmData:
>>>>>> 1065860 kB
>>>>>> VmStk:
>>>>>>  136 kB
>>>>>> VmExe:
>>>>>>    32 kB
>>>>>> VmLib:
>>>>>> 10416 kB
>>>>>> VmPTE:
>>>>>>  2164 kB
>>>>>> VmSwap:
>>>>>> 822752 kB
>>>>>> 
>>>>>> Passive node
>>>>>> VmPeak:
>>>>>> 220832 kB
>>>>>> VmSize:
>>>>>> 155428 kB
>>>>>> VmLck:
>>>>>>    0 kB
>>>>>> VmPin:
>>>>>>    0 kB
>>>>>> VmHWM:
>>>>>>  4568 kB
>>>>>> VmRSS:
>>>>>>  3880 kB
>>>>>> VmData:
>>>>>> 74548 kB
>>>>>> VmStk:
>>>>>>  136 kB
>>>>>> VmExe:
>>>>>>    32 kB
>>>>>> VmLib:
>>>>>> 10416 kB
>>>>>> VmPTE:
>>>>>>  172 kB
>>>>>> VmSwap:
>>>>>>    0 kB
>>>>>> 
>>>>>> During the last week or so I¹ve taken a couple of snapshots of
>>>>>> /proc/PID/smaps on the active node, and the heap particularly stands
>>>>>> out
>>>>>> as growing: (I have the full outputs captured if they¹ll help)
>>>>>> 
>>>>>> 20140422
>>>>>> 7f92e1578000-7f92f218b000 rw-p 00000000 00:00 0
>>>>>> [heap]
>>>>>> Size:             274508 kB
>>>>>> Rss:              180152 kB
>>>>>> Pss:              180152 kB
>>>>>> Shared_Clean:          0 kB
>>>>>> Shared_Dirty:          0 kB
>>>>>> Private_Clean:         0 kB
>>>>>> Private_Dirty:    180152 kB
>>>>>> Referenced:       120472 kB
>>>>>> Anonymous:        180152 kB
>>>>>> AnonHugePages:         0 kB
>>>>>> Swap:              91568 kB
>>>>>> KernelPageSize:        4 kB
>>>>>> MMUPageSize:           4 kB
>>>>>> Locked:                0 kB
>>>>>> VmFlags: rd wr mr mw me ac
>>>>>> 
>>>>>> 
>>>>>> 20140423
>>>>>> 7f92e1578000-7f92f305e000 rw-p 00000000 00:00 0
>>>>>> [heap]
>>>>>> Size:             289688 kB
>>>>>> Rss:              184136 kB
>>>>>> Pss:              184136 kB
>>>>>> Shared_Clean:          0 kB
>>>>>> Shared_Dirty:          0 kB
>>>>>> Private_Clean:         0 kB
>>>>>> Private_Dirty:    184136 kB
>>>>>> Referenced:        69748 kB
>>>>>> Anonymous:        184136 kB
>>>>>> AnonHugePages:         0 kB
>>>>>> Swap:             103112 kB
>>>>>> KernelPageSize:        4 kB
>>>>>> MMUPageSize:           4 kB
>>>>>> Locked:                0 kB
>>>>>> VmFlags: rd wr mr mw me ac
>>>>>> 
>>>>>> 20140430
>>>>>> 7f92e1578000-7f92fc01d000 rw-p 00000000 00:00 0
>>>>>> [heap]
>>>>>> Size:             436884 kB
>>>>>> Rss:              140812 kB
>>>>>> Pss:              140812 kB
>>>>>> Shared_Clean:          0 kB
>>>>>> Shared_Dirty:          0 kB
>>>>>> Private_Clean:       744 kB
>>>>>> Private_Dirty:    140068 kB
>>>>>> Referenced:        43600 kB
>>>>>> Anonymous:        140812 kB
>>>>>> AnonHugePages:         0 kB
>>>>>> Swap:             287392 kB
>>>>>> KernelPageSize:        4 kB
>>>>>> MMUPageSize:           4 kB
>>>>>> Locked:                0 kB
>>>>>> VmFlags: rd wr mr mw me ac
>>>>>> 
>>>>>> I noticed in the release notes for 1.1.10-rc1
>>>>>> 
>>>>>> (https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.1
>>>>>> 0-r
>>>>>> c1) that there was work done to fix "crmd: lrmd: stonithd: fixed
>>>>>> memory
>>>>>> leaks² but I¹m not sure which particular bug this was related to.
>>>>>> (And
>>>>>> those fixes should be in the version I¹m running anyway).
>>>>>> 
>>>>>> I¹ve also spotted a few memory leak fixes in
>>>>>> https://github.com/beekhof/pacemaker, but I¹m not sure whether they
>>>>>> relate to my issue (assuming I have a memory leak and this isn¹t
>>>>>> expected behaviour).
>>>>>> 
>>>>>> Is there additional debugging that I can perform to check whether I
>>>>>> have a leak, or is there enough evidence to justify upgrading to
>>>>>> 1.1.11?
>>>>>> 
>>>>>> Thanks in advance
>>>>>> 
>>>>>> Greg Murphy
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>> 
>>>> <lrmd.tgz>_______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>> 
> 
> <lrmd-dbg.tgz>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140507/39f1b8be/attachment-0001.sig>