[Pacemaker] High load issues

Fri Feb 5 02:59:50 EST 2010

> But generally I believe this test case is invalid.

I might agree here that this test case does not necessarily reproduce
what happened on my production system (unfortunately I do not know for
sure what happened there, the dev who caused this just tells me he used
some stupid sql statement and even executed it several times in
parallel), but I do not think the testcase is invalid. If there is an
OOM situation on a node and therefore the local pacemaker can't do it's
job anymore (I base this statement on the various lrmd "cannot allocate
memory" logs), this is a case the cluster should be able to recover from.

What I saw while doing this test was that the bad node discovered
failures on the running ip and mysql resources, scheduled the recovery,
but never managed to recover.

I think it was lmb who suggested "periodic health-checks" on the
pacemaker layer. If pacemaker on $good had periodically tried to talk to
pacemaker on $bad, then it might have seen that $bad does not respond
and might have done something about it. Just my theory though.

Opinions?

Regards
Dominik