[Pacemaker] High load issues

Thu Feb 4 15:09:53 UTC 2010

Hi people,

I'll take the risk of annoying you, but I really think this should not
be forgotten.

If there is high load on a node, the cluster seems to have problems
recovering from that. I'd expect the cluster to recognize that a node is
unresponsive, stonith it and start services elsewhere.

By unresponsive I mean not being able to use the cluster's service, not
being able to ssh into the node.

I am not sure whether this is an issue of pacemaker (iiuc, beekhof seems
to think it is not) or corosync (iiuc, sdake seems to think it is not)
or maybe a configuration/thinking thing on my side (which might just be).

Anyway, attached you will find a hb_report which covers the startup of
the cluster nodes, then what it does when there is high load and no
memory left. Then I killed the load producing things and almost
immediately, the cluster cleaned up things.

I had at least expected that after I saw "FAILED" status in crm_mon,
that after the configured timeouts for stop (120s max in my case), the
failover should happen, but it did not.

What I did to produce load:
* run several "md5sum $file" on 1gig files
* run several heavy sql statements on large tables
* saturate(?) the nic using netcat -l on the busy node and netcat -w fed
by /dev/urandom on another node
* start a forkbomb script which does "while (true); do bash $0; done;"

Used versions:
corosync 1.2.0
pacemaker 1.0.7
64 bit packages from clusterlabs for opensuse 11.1

If you need more information, want me to try patches, whatever, please
let me know.

Regards
Dominik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: highload.tar.bz2
Type: application/x-bzip
Size: 109175 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100204/45f06346/attachment-0001.bin>