[Pacemaker] High load issues

Fri Feb 5 11:20:25 UTC 2010

Hi,

On Fri, Feb 05, 2010 at 08:59:50AM +0100, Dominik Klein wrote:
> > But generally I believe this test case is invalid.
> 
> I might agree here that this test case does not necessarily reproduce
> what happened on my production system (unfortunately I do not know for
> sure what happened there, the dev who caused this just tells me he used
> some stupid sql statement and even executed it several times in
> parallel), but I do not think the testcase is invalid. If there is an
> OOM situation on a node and therefore the local pacemaker can't do it's
> job anymore (I base this statement on the various lrmd "cannot allocate
> memory" logs), this is a case the cluster should be able to recover from.

Yes, I'd say the cluster should be able to deal with a node which
is in just about any state. This time, at least it seems so, the
problem was that corosync ran as a realtime process and crmd not.
Perhaps corosync should watch the local processes, i.e. to have
some kind of IPC heartbeat ...

> What I saw while doing this test was that the bad node discovered
> failures on the running ip and mysql resources, scheduled the recovery,
> but never managed to recover.
> 
> I think it was lmb who suggested "periodic health-checks" on the
> pacemaker layer. If pacemaker on $good had periodically tried to talk to
> pacemaker on $bad, then it might have seen that $bad does not respond
> and might have done something about it. Just my theory though.

... or the higher level heartbeats as you suggested here. There
is still, however, a problem with false positives. At any rate,
the user should have a way to specify when a node is not usable
anymore.

Thanks,

Dejan

> Opinions?
> 
> Regards
> Dominik
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker