[Pacemaker] High load issues
Dominik Klein
dk at in-telegence.net
Thu Feb 4 15:09:53 UTC 2010
Hi people,
I'll take the risk of annoying you, but I really think this should not
be forgotten.
If there is high load on a node, the cluster seems to have problems
recovering from that. I'd expect the cluster to recognize that a node is
unresponsive, stonith it and start services elsewhere.
By unresponsive I mean not being able to use the cluster's service, not
being able to ssh into the node.
I am not sure whether this is an issue of pacemaker (iiuc, beekhof seems
to think it is not) or corosync (iiuc, sdake seems to think it is not)
or maybe a configuration/thinking thing on my side (which might just be).
Anyway, attached you will find a hb_report which covers the startup of
the cluster nodes, then what it does when there is high load and no
memory left. Then I killed the load producing things and almost
immediately, the cluster cleaned up things.
I had at least expected that after I saw "FAILED" status in crm_mon,
that after the configured timeouts for stop (120s max in my case), the
failover should happen, but it did not.
What I did to produce load:
* run several "md5sum $file" on 1gig files
* run several heavy sql statements on large tables
* saturate(?) the nic using netcat -l on the busy node and netcat -w fed
by /dev/urandom on another node
* start a forkbomb script which does "while (true); do bash $0; done;"
Used versions:
corosync 1.2.0
pacemaker 1.0.7
64 bit packages from clusterlabs for opensuse 11.1
If you need more information, want me to try patches, whatever, please
let me know.
Regards
Dominik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: highload.tar.bz2
Type: application/x-bzip
Size: 109175 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100204/45f06346/attachment-0001.bin>
More information about the Pacemaker
mailing list