[Pacemaker] lrmd Memory Usage

Wed May 7 10:31:30 UTC 2014

Thanks Andrew, much appreciated.

I’ll try upgrading to 1.11 and report back with how it goes.

On 07/05/2014 01:20, "Andrew Beekhof" <andrew at beekhof.net> wrote:

>
>On 6 May 2014, at 7:47 pm, Greg Murphy <greg.murphy at gamesparks.com> wrote:
>
>> Here you go - I’ve only run lrmd for 30 minutes since installing the
>>debug
>> package, but hopefully that’s enough - if not, let me know and I’ll do a
>> longer capture.
>> 
>
>I'll keep looking, but almost everything so far seems to be from or
>related to the g_dbus API:
>
>...
>==37625==    by 0x6F20E30: g_dbus_proxy_new_for_bus_sync (in
>/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.3800.1)
>==37625==    by 0x507B90B: get_proxy (upstart.c:66)
>==37625==    by 0x507B9BF: upstart_init (upstart.c:85)
>==37625==    by 0x507C88E: upstart_job_exec (upstart.c:429)
>==37625==    by 0x10CE03: lrmd_rsc_dispatch (lrmd.c:879)
>==37625==    by 0x4E5F112: crm_trigger_dispatch (mainloop.c:105)
>==37625==    by 0x58A13B5: g_main_context_dispatch (in
>/lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1)
>==37625==    by 0x58A1707: ??? (in
>/lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1)
>==37625==    by 0x58A1B09: g_main_loop_run (in
>/lib/x86_64-linux-gnu/libglib-2.0.so.0.3800.1)
>==37625==    by 0x10AC3A: main (main.c:314)
>
>Which is going to be called every time an upstart job is run (ie.
>recurring monitor of an upstart resource)
>
>There were several problems with that API and we removed all use of it in
>1.1.11.
>I'm quite confident that most, if not all, of the memory issues would go
>away if you upgraded.
>
>
>> 
>> 
>> On 06/05/2014 10:08, "Andrew Beekhof" <andrew at beekhof.net> wrote:
>> 
>>> Oh, any any chance you could install the debug packages? It will make
>>>the
>>> output even more useful :-)
>>> 
>>> On 6 May 2014, at 7:06 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>> 
>>>> 
>>>> On 6 May 2014, at 6:05 pm, Greg Murphy <greg.murphy at gamesparks.com>
>>>> wrote:
>>>> 
>>>>> Attached are the valgrind outputs from two separate runs of lrmd with
>>>>> the
>>>>> suggested variables set. Do they help narrow the issue down?
>>>> 
>>>> They do somewhat.  I'll investigate.  But much of the memory is still
>>>> reachable:
>>>> 
>>>> ==26203==    indirectly lost: 17,945,950 bytes in 642,546 blocks
>>>> ==26203==      possibly lost: 2,805 bytes in 60 blocks
>>>> ==26203==    still reachable: 26,104,781 bytes in 544,782 blocks
>>>> ==26203==         suppressed: 8,652 bytes in 176 blocks
>>>> ==26203== Reachable blocks (those to which a pointer was found) are
>>>>not
>>>> shown.
>>>> ==26203== To see them, rerun with: --leak-check=full
>>>> --show-reachable=yes
>>>> 
>>>> Could you add the --show-reachable=yes to VALGRIND_OPTS variable?
>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Greg
>>>>> 
>>>>> 
>>>>> On 02/05/2014 03:01, "Andrew Beekhof" <andrew at beekhof.net> wrote:
>>>>> 
>>>>>> 
>>>>>> On 30 Apr 2014, at 9:01 pm, Greg Murphy <greg.murphy at gamesparks.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> I¹m running a two-node Pacemaker cluster on Ubuntu Saucy (13.10),
>>>>>>> kernel 3.11.0-17-generic and the Ubuntu Pacemaker package, version
>>>>>>> 1.1.10+git20130802-1ubuntu1.
>>>>>> 
>>>>>> The problem is that I have no way of knowing what code is/isn't
>>>>>> included
>>>>>> in '1.1.10+git20130802-1ubuntu1'.
>>>>>> You could try setting the following in your environment before
>>>>>> starting
>>>>>> pacemaker though
>>>>>> 
>>>>>> # Variables for running child daemons under valgrind and/or checking
>>>>>> for
>>>>>> memory problems
>>>>>> G_SLICE=always-malloc
>>>>>> MALLOC_PERTURB_=221 # or 0
>>>>>> MALLOC_CHECK_=3     # or 0,1,2
>>>>>> PCMK_valgrind_enabled=lrmd
>>>>>> VALGRIND_OPTS="--leak-check=full --trace-children=no
>>>>>>--num-callers=25
>>>>>> --log-file=/var/lib/pacemaker/valgrind-%p
>>>>>> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions
>>>>>> --gen-suppressions=all"
>>>>>> 
>>>>>> 
>>>>>>> The cluster is configured with a DRBD master/slave set and then a
>>>>>>> failover resource group containing MySQL (along with its DRBD
>>>>>>> filesystem) and a Zabbix Proxy and Agent.
>>>>>>> 
>>>>>>> Since I built the cluster around two months ago I¹ve noticed that
>>>>>>>on
>>>>>>> the the active node the memory footprint of lrmd gradually grows to
>>>>>>> quite a significant size. The cluster was last restarted three
>>>>>>>weeks
>>>>>>> ago, and now lrmd has over 1GB of mapped memory on the active node
>>>>>>> and
>>>>>>> only 151MB on the passive node. Current excerpts from
>>>>>>> /proc/PID/status
>>>>>>> are:
>>>>>>> 
>>>>>>> Active node
>>>>>>> VmPeak:
>>>>>>> 1146740 kB
>>>>>>> VmSize:
>>>>>>> 1146740 kB
>>>>>>> VmLck:
>>>>>>>    0 kB
>>>>>>> VmPin:
>>>>>>>    0 kB
>>>>>>> VmHWM:
>>>>>>> 267680 kB
>>>>>>> VmRSS:
>>>>>>> 188764 kB
>>>>>>> VmData:
>>>>>>> 1065860 kB
>>>>>>> VmStk:
>>>>>>>  136 kB
>>>>>>> VmExe:
>>>>>>>    32 kB
>>>>>>> VmLib:
>>>>>>> 10416 kB
>>>>>>> VmPTE:
>>>>>>>  2164 kB
>>>>>>> VmSwap:
>>>>>>> 822752 kB
>>>>>>> 
>>>>>>> Passive node
>>>>>>> VmPeak:
>>>>>>> 220832 kB
>>>>>>> VmSize:
>>>>>>> 155428 kB
>>>>>>> VmLck:
>>>>>>>    0 kB
>>>>>>> VmPin:
>>>>>>>    0 kB
>>>>>>> VmHWM:
>>>>>>>  4568 kB
>>>>>>> VmRSS:
>>>>>>>  3880 kB
>>>>>>> VmData:
>>>>>>> 74548 kB
>>>>>>> VmStk:
>>>>>>>  136 kB
>>>>>>> VmExe:
>>>>>>>    32 kB
>>>>>>> VmLib:
>>>>>>> 10416 kB
>>>>>>> VmPTE:
>>>>>>>  172 kB
>>>>>>> VmSwap:
>>>>>>>    0 kB
>>>>>>> 
>>>>>>> During the last week or so I¹ve taken a couple of snapshots of
>>>>>>> /proc/PID/smaps on the active node, and the heap particularly
>>>>>>>stands
>>>>>>> out
>>>>>>> as growing: (I have the full outputs captured if they¹ll help)
>>>>>>> 
>>>>>>> 20140422
>>>>>>> 7f92e1578000-7f92f218b000 rw-p 00000000 00:00 0
>>>>>>> [heap]
>>>>>>> Size:             274508 kB
>>>>>>> Rss:              180152 kB
>>>>>>> Pss:              180152 kB
>>>>>>> Shared_Clean:          0 kB
>>>>>>> Shared_Dirty:          0 kB
>>>>>>> Private_Clean:         0 kB
>>>>>>> Private_Dirty:    180152 kB
>>>>>>> Referenced:       120472 kB
>>>>>>> Anonymous:        180152 kB
>>>>>>> AnonHugePages:         0 kB
>>>>>>> Swap:              91568 kB
>>>>>>> KernelPageSize:        4 kB
>>>>>>> MMUPageSize:           4 kB
>>>>>>> Locked:                0 kB
>>>>>>> VmFlags: rd wr mr mw me ac
>>>>>>> 
>>>>>>> 
>>>>>>> 20140423
>>>>>>> 7f92e1578000-7f92f305e000 rw-p 00000000 00:00 0
>>>>>>> [heap]
>>>>>>> Size:             289688 kB
>>>>>>> Rss:              184136 kB
>>>>>>> Pss:              184136 kB
>>>>>>> Shared_Clean:          0 kB
>>>>>>> Shared_Dirty:          0 kB
>>>>>>> Private_Clean:         0 kB
>>>>>>> Private_Dirty:    184136 kB
>>>>>>> Referenced:        69748 kB
>>>>>>> Anonymous:        184136 kB
>>>>>>> AnonHugePages:         0 kB
>>>>>>> Swap:             103112 kB
>>>>>>> KernelPageSize:        4 kB
>>>>>>> MMUPageSize:           4 kB
>>>>>>> Locked:                0 kB
>>>>>>> VmFlags: rd wr mr mw me ac
>>>>>>> 
>>>>>>> 20140430
>>>>>>> 7f92e1578000-7f92fc01d000 rw-p 00000000 00:00 0
>>>>>>> [heap]
>>>>>>> Size:             436884 kB
>>>>>>> Rss:              140812 kB
>>>>>>> Pss:              140812 kB
>>>>>>> Shared_Clean:          0 kB
>>>>>>> Shared_Dirty:          0 kB
>>>>>>> Private_Clean:       744 kB
>>>>>>> Private_Dirty:    140068 kB
>>>>>>> Referenced:        43600 kB
>>>>>>> Anonymous:        140812 kB
>>>>>>> AnonHugePages:         0 kB
>>>>>>> Swap:             287392 kB
>>>>>>> KernelPageSize:        4 kB
>>>>>>> MMUPageSize:           4 kB
>>>>>>> Locked:                0 kB
>>>>>>> VmFlags: rd wr mr mw me ac
>>>>>>> 
>>>>>>> I noticed in the release notes for 1.1.10-rc1
>>>>>>> 
>>>>>>> 
>>>>>>>(https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1
>>>>>>>.1
>>>>>>> 0-r
>>>>>>> c1) that there was work done to fix "crmd: lrmd: stonithd: fixed
>>>>>>> memory
>>>>>>> leaks² but I¹m not sure which particular bug this was related to.
>>>>>>> (And
>>>>>>> those fixes should be in the version I¹m running anyway).
>>>>>>> 
>>>>>>> I¹ve also spotted a few memory leak fixes in
>>>>>>> https://github.com/beekhof/pacemaker, but I¹m not sure whether they
>>>>>>> relate to my issue (assuming I have a memory leak and this isn¹t
>>>>>>> expected behaviour).
>>>>>>> 
>>>>>>> Is there additional debugging that I can perform to check whether I
>>>>>>> have a leak, or is there enough evidence to justify upgrading to
>>>>>>> 1.1.11?
>>>>>>> 
>>>>>>> Thanks in advance
>>>>>>> 
>>>>>>> Greg Murphy
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>> 
>>>>> <lrmd.tgz>_______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>> 
>> <lrmd-dbg.tgz>_______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>