[Pacemaker] killing corosync leaves crmd, stonithd, lrmd, cib and attrd to hog up the cpu

Mon Nov 14 07:18:43 EST 2011

Hi,

On Mon, Nov 14, 2011 at 1:32 PM, ihjaz Mohamed <ihjazmohamed at yahoo.co.in> wrote:
> Hi All,
> As part of some robustness test for my cluster, I tried killing the corosync
> process using kill -9 <pid>. After this I see that the pacemakerd service is
> stopped but the processes crmd, stonithd, lrmd, cib and attrd are still
> running and are hogging up the cpu.

I have seen this kind of testing before and I have to say I don't
consider it the recommended way of testing the cluster stack's
"robustness". Pacemaker processes rely on corosync for proper
functioning. You kill corosync and then want to "cleanup" the
processes? You have to go through a lot more literature in order to
understand how this cluster stack works.

For the Master Control Process, how it works and other related
information (which is related to what you are experiencing), see
http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for

The essential guide you need is
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/

HTH,
Dan

>
> top - 06:26:51 up  2:01,  4 users,  load average: 12.04, 12.01, 11.98
> Tasks: 330 total,  13 running, 317 sleeping,   0 stopped,   0 zombie
> Cpu(s):  7.1%us, 17.1%sy,  0.0%ni, 75.6%id,  0.1%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   8015444k total,  4804412k used,  3211032k free,    54800k buffers
> Swap: 10256376k total,        0k used, 10256376k free,  1604464k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 hacluste  RT   0 90492 3324 2476 R 100.0  0.0 113:40.61 crmd
>  2047 root      RT   0 81480 2108 1712 R 99.8  0.0 113:40.43 stonithd
>  2048 hacluste  RT   0 83404 5260 2992 R 99.8  0.1 113:40.90 cib
>  2050 hacluste  RT   0 85896 2388 1952 R 99.8  0.0 113:40.43 attrd
>  5018 root      20   0 8787m 345m  56m S  2.0  4.4   0:56.95 java
> 19017 root      20   0 15068 1252  796 R  2.0  0.0   0:00.01 top
>     1 root      20   0 19232 1444 1156 S  0.0  0.0   0:01.71 init
>     2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
>     3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
>     4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
>
>
> Is there a way to cleanup these processes ? OR Do I need to kill them one by
> one before respawning the corosync?
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

-- 
Dan Frincu
CCNA, RHCE