[ClusterLabs] Antw: Re: Memory leak in crm_mon ?

Mon Aug 17 06:59:28 UTC 2015

> On 17 Aug 2015, at 4:35 pm, Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> 
>>>> Andrew Beekhof <andrew at beekhof.net> schrieb am 17.08.2015 um 00:08 in
> Nachricht
> <FF78BE4F-173C-4A74-A989-92EA6C540A6B at beekhof.net>:
> 
>>> On 16 Aug 2015, at 9:41 pm, Attila Megyeri <amegyeri at minerva-soft.com>
> wrote:
>>> 
>>> Hi Andrew,
>>> 
>>> I managed to isolate / reproduce the issue. You might want to take a look,
> 
>> as it might be present in 1.1.12 as well.
>>> 
>>> I monitor my cluster from putty, mainly this way:
>>> - I have a putty (Windows client) session, that connects via SSH to the
> box, 
>> authenticates using public key as a non-root user.
>>> - It immediately sends a "sudo crm_mon -Af" command, so with a single click
> 
>> I have a nice view of what the cluster is doing.
>> 
>> Perhaps add -1 to the option list.
>> The root cause seems to be that closing the putty window doesn’t actually
> 
>> kill the process running inside it.
> 
> Sorry, the root cause seems to be that cm_mon happily writes to a closed
> filehandle (I guess). If crm_mon would handle that error by exiting the loop,
> ther would be no need for putty  to kill any process.

No, if you want a process to die you need to kill it.

> 
>> 
>>> 
>>> Whenever I close this putty window (terminate the app), crm_mon process
> gets 
>> to 100% cpu usage, starts to leak, in a few hours consumes all memory and 
>> then destroys the whole cluster.
>>> This does not happen if I leave crm_mon with Ctrl-C.
>>> 
>>> I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu 
>> trusty packages.
>>> This might be related on how sudo executes crm_mon, and what it signalls to
> 
>> crm_mon when it gets terminated.
>>> 
>>> Now I know what I need to pay attention to in order to avoid this problem,
> 
>> but you might want to check whether this issue is still present.
>>> 
>>> 
>>> Thanks,
>>> Attila 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com] 
>>> Sent: Friday, August 14, 2015 12:40 AM
>>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>> <users at clusterlabs.org>
>>> Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Andrew Beekhof [mailto:andrew at beekhof.net] 
>>> Sent: Tuesday, August 11, 2015 2:49 AM
>>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>> <users at clusterlabs.org>
>>> Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
>>> 
>>> 
>>>> On 10 Aug 2015, at 5:33 pm, Attila Megyeri <amegyeri at minerva-soft.com>
> wrote:
>>>> 
>>>> Hi!
>>>> 
>>>> We are building a new cluster on top of pacemaker/corosync and several
> times 
>> during the past days we noticed that „crm_mon -Af” used up all the 
>> memory+swap and caused high CPU usage. Killing the process solves the
> issue.
>>>> 
>>>> We are using the binary package versions available in the latest ubuntu 
>> trusty, namely:
>>>> 
>>>> crmsh                                                 
> 1.2.5+hg1034-1ubuntu4 
>> 
>>>> pacemaker                                        
>> 1.1.10+git20130802-1ubuntu2.3  
>>>> pacemaker-cli-utils                        1.1.10+git20130802-1ubuntu2.3 
> 
>>>> corosync                                             2.3.3-1ubuntu1   
>>>> 
>>>> Kernel is                                             3.13.0-46-generic
>>>> 
>>>> Looking back some „atop” data, the CPU went to 100% many times during
> the 
>> last couple of days, at various times, more often around midnight exaclty 
>> (strange).
>>>> 
>>>> 08.05     14:00
>>>> 08.06     21:41
>>>> 08.07     00:00
>>>> 08.07     00:00
>>>> 08.08     00:00
>>>> 08.09     06:27
>>>> 
>>>> Checked the corosync log and syslog, but did not find any correlation 
>> between the entries int he logs around the specific times.
>>>> For most of the time, the node running the crm_mon was the DC as well –
> not 
>> running any resources (e.g. a pairless node for quorum).
>>>> 
>>>> 
>>>> We have another running system, where everything works perfecly, whereas
> it 
>> is almost the same:
>>>> 
>>>> crmsh                                                 
> 1.2.5+hg1034-1ubuntu4 
>> 
>>>> pacemaker                                        
>> 1.1.10+git20130802-1ubuntu2.1 
>>>> pacemaker-cli-utils                        1.1.10+git20130802-1ubuntu2.1 
>>>> corosync                                             2.3.3-1ubuntu1      
>>>> 
>>>> Kernel is                                             3.13.0-8-generic
>>>> 
>>>> 
>>>> Is this perhaps a known issue?
>>> 
>>> Possibly, that version is over 2 years old.
>>> 
>>>> Any hints?
>>> 
>>> Getting something a little more recent would be the best place to start
>>> 
>>> Thanks Andew,
>>> 
>>> I tried to upgrade to 1.1.12 using the packages availabe at 
>> https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a
> 
>> single node, to see how it works out but I ended up with errors like
>>> 
>>> Could not establish cib_rw connection: Connection refused (111)
>>> 
>>> I have disabled the firewall, no changes. The node appears to be running
> but 
>> does not see any of the other nodes. On the other nodes I see this node as
> an 
>> UNCLEAN one. (I assume corosync is fine, but pacemaker not)
>>> I use udpu for the transport.
>>> 
>>> Am I doing something wrong? I tried to look for some howtos on upgrade, but
> 
>> the only thing I found was the rather outdated   
>> http://clusterlabs.org/wiki/Upgrade 
>>> 
>>> Could you please direct me to some howto/guide on how to perform the 
>> upgrade?
>>> 
>>> Or am I facing some compatibility issue, so I should extract the whole cib,
> 
>> upgrade all nodes and reconfigure the cluster from the scratch? (The cluster
> 
>> is meant to go live in 2 days... :) )
>>> 
>>> Thanks a lot in advance
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks!
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org 
>>>> http://clusterlabs.org/mailman/listinfo/users 
>>>> 
>>>> Project Home: http://www.clusterlabs.org Getting started: 
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>> 
>>> 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org