[Pacemaker] Pacemaker still may include memory leaks

Mon May 27 07:08:48 UTC 2013

27.05.2013 04:20, Yuichi SEINO wrote:
> Hi,
> 
> 2013/5/24 Vladislav Bogdanov <bubble at hoster-ok.com>:
>> 24.05.2013 06:34, Andrew Beekhof wrote:
>>> Any help figuring out where the leaks might be would be very much appreciated :)
>>
>> One (and the only) suspect is unfortunately crmd itself.
>> It has private heap grown from 2708 to 3680 kB.
>>
>> All other relevant differences are in qb shm buffers, which are
>> controlled and may grow until they reach configured size.
>>
>> @Yuichi
>> I would recommend to try running under valgrind on a testing cluster to
>> figure out is that a memleak (lost memory) or some history data
>> (referenced memory). Latter may be a logical memleak though. You may
>> look in /etc/sysconfig/pacemaker for details.
> 
> I got valgrind for about 2 days. And, I attached valgrind in ACT node
> and SBY node.

I do not see any "direct" memory leaks (repeating 'definitely-lost'
allocations) there.

So what we see is probably one of:
* Cache/history/etc, which grows up to some limit (or expired at the
some point in time).
* Unlimited/not-expirable lists/hashes of data structures, which are
correctly freed at exit (f.e like dlm_controld has(had???) for a
debugging buffer or like glibc resolver had in EL3). This cannot be
caught with valgrind if you use it in a standard way.

I believe we have former one. To prove that, it would be very
interesting to run under valgrind *debugger* (--vgdb=yes|full) for some
long enough (2-3 weeks) period of time and periodically get memory
allocation state from there (with 'monitor leak_check full reachable
any' gdb command). I wanted to do that a long time ago, but
unfortunately did not have enough spare time to even try that (although
I tried to valgrind other programs that way).

This is described in valgrind documentation:
http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.gdbserver

We probably do not need to specify '--vgdb-error=0' because we do not
need to install watchpoints at the start (and we do not need/want to
immediately connect to crmd with gdb to tell it to continue), we just
need to periodically get status of memory allocations
(stop-leak_check-cont sequence). Probably that should be done in a
'fast' manner, so crmd does not stop for a long time, and the rest of
pacemaker does not see it 'hanged'. Again, I did not try that, and I do
not know if it's even possible to do that with crmd.

And, as pacemaker heavily utilizes glib, which has own memory allocator
(slices), it is better to switch it to a 'standard' malloc/free for
debugging with G_SLICE=always-malloc env var.

Last, I did memleak checks for a 'static' (i.e. no operations except
monitors are performed) cluster for ~1.1.8, and did not find any. It
would be interesting to see if that is true for an 'active' one, which
starts/stops resources, handles failures, etc.

> 
> Sincerely,
> Yuichi
> 
>>
>>>
>>> Also, the measurements are in pages... could you run "getconf PAGESIZE" and let us know the result?
>>> I'm guessing 4096 bytes.
>>>
>>> On 23/05/2013, at 5:47 PM, Yuichi SEINO <seino.cluster2 at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I retry the test after we updated packages to the latest tag and OS.
>>>> glue and booth is latest.
>>>>
>>>> * Environment
>>>> OS:RHEL 6.4
>>>> cluster-glue:latest(commit:2755:8347e8c9b94f) +
>>>> patch[detail:http://www.gossamer-threads.com/lists/linuxha/dev/85787]
>>>> resource-agent:v3.9.5
>>>> libqb:v0.14.4
>>>> corosync:v2.3.0
>>>> pacemaker:v1.1.10-rc2
>>>> crmsh:v1.2.5
>>>> booth:latest(commit:67e1208973de728958432aaba165766eac1ce3a0)
>>>>
>>>> * Test procedure
>>>> we regularly switch a ticket. The previous test also used the same way.
>>>> And, There was no a memory leak when we tested pacemaker-1.1 before
>>>> pacemaker use libqb.
>>>>
>>>> * Result
>>>> As a result, I think that crmd may cause the memory leak.
>>>>
>>>> crmd smaps(a total of each addresses)
>>>> In detail, we attached smaps of  start and end. And, I recorded smaps
>>>> every 1 minutes.
>>>>
>>>> Start
>>>> RSS: 7396
>>>> SHR(Shared_Clean+Shared_Dirty):3560
>>>> Private(Private_Clean+Private_Dirty):3836
>>>>
>>>> Interbal(about 30h later)
>>>> RSS:18464
>>>> SHR:14276
>>>> Private:4188
>>>>
>>>> End(about 70h later)
>>>> RSS:19104
>>>> SHR:14336
>>>> Private:4768
>>>>
>>>> Sincerely,
>>>> Yuichi
>>>>
>>>> 2013/5/15 Yuichi SEINO <seino.cluster2 at gmail.com>:
>>>>> Hi,
>>>>>
>>>>> I ran the test for about two days.
>>>>>
>>>>> Environment
>>>>>
>>>>> OS:RHEL 6.3
>>>>> pacemaker-1.1.9-devel (commit 138556cb0b375a490a96f35e7fbeccc576a22011)
>>>>> corosync-2.3.0
>>>>> cluster-glue latest+patch(detail:http://www.gossamer-threads.com/lists/linuxha/dev/85787)
>>>>> libqb- 0.14.4
>>>>>
>>>>> There may be a memory leak in crmd and lrmd. I regularly got rss of ps.
>>>>>
>>>>> start-up
>>>>> crmd:5332
>>>>> lrmd:3625
>>>>>
>>>>> interval(about 30h later)
>>>>> crmd:7716
>>>>> lrmd:3744
>>>>>
>>>>> ending(about 60h later)
>>>>> crmd:8336
>>>>> lrmd:3780
>>>>>
>>>>> I still don't run a test that pacemaker-1.1.10-rc2 use. So, I will run its test.
>>>>>
>>>>> Sincerely,
>>>>> Yuichi
>>>>>
>>>>> --
>>>>> Yuichi SEINO
>>>>> METROSYSTEMS CORPORATION
>>>>> E-mail:seino.cluster2 at gmail.com
>>>>
>>>>
>>>>
>>>> --
>>>> Yuichi SEINO
>>>> METROSYSTEMS CORPORATION
>>>> E-mail:seino.cluster2 at gmail.com
>>>> <smaps_log.tar.gz>_______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.cluster2 at gmail.com
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>