[Pacemaker] Speeding up startup after migration

Mon Apr 1 18:21:53 EDT 2013

On 2013-04-01T13:09:14, David Vossel <dvossel at redhat.com> wrote:

> > So, if I understand correctly, new lrmd runs as many simultaneous jobs
> > as possible. Unfortunately, in some circumstances this would result in
> > the high node load and timeouts. Is there a way to some-how limit that load?
> Isn't that what the batch-limit option does?  or are you saying you want a batch limit type option that is node specific? Why are you concerned about this behavior living in the LRMD instead of at the transition processing level?
> 
> I believe if we do any batch limiting type behavior at the LRMD level we're going to run into problems with the transition timers in the crmd.  The LRMD needs to always perform the actions it is given as soon as possible.

Seriously, folks, the LRM rewrite may turn out not to be the best
example of pacemaker's attention to detail ;-)

Yes, the previous LRM had a per-node concurrency limit. This avoided
overloading the nodes via IO, which is why it was added. (And also
smoothed out spikes in the monitoring calls should they happen to
coincide.) Default limit of parallel executions was 4 or half the number
of CPU cores, if memory serves.

This turned out to actually improve performance (since it avoided said
spikes), and avoid timeouts. (While it is true that, given a perfect
scheduler, the total runtime of N_1..100 being kicked off all at once
should be equal to N_1..100 being kicked off serially, it's quite
likely that doing the former will mean at least a few of those 100
operations hitting its *individual* timeout at the LRM level.)

The TE doesn't have enough knowledge to enforce this, since it doesn't
know if monitors get scheduled. The transition timers weren't really a
problem, since they had some lee-way accounted for.

If we don't have this functionality right now anymore, I do believe we
need it back.

I do seem to recall that at the time, Andrew preferred it to be
implemented at the LRM level, because it avoided a more complex
transition graph logic (e.g., the batch-limit functionality on a
per-node level, and doing something smart about monitors); but my memory
is hazy on this detail.

Nowadays, since we have the migration-threshold anyway, it may be
possible to do something about it cleanly in the TE, but that still
would leave the monitors unsolved ...

Regards,
    Lars

(PS: 1.1.8 really isn't turning out to be my favorite release. If I
wasn't afraid it'd received as a rant, I'd try to write up a post-mortem
from my/our perspective to see what might be avoidable in the future.)

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde