[Pacemaker] Speeding up startup after migration

David Vossel dvossel at redhat.com
Tue Apr 2 17:02:01 EDT 2013


----- Original Message -----
> From: "Lars Marowsky-Bree" <lmb at suse.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Monday, April 1, 2013 5:21:53 PM
> Subject: Re: [Pacemaker] Speeding up startup after migration
> 
> On 2013-04-01T13:09:14, David Vossel <dvossel at redhat.com> wrote:
> 
> > > So, if I understand correctly, new lrmd runs as many simultaneous jobs
> > > as possible. Unfortunately, in some circumstances this would result in
> > > the high node load and timeouts. Is there a way to some-how limit that
> > > load?
> > Isn't that what the batch-limit option does?  or are you saying you want a
> > batch limit type option that is node specific? Why are you concerned about
> > this behavior living in the LRMD instead of at the transition processing
> > level?
> > 
> > I believe if we do any batch limiting type behavior at the LRMD level we're
> > going to run into problems with the transition timers in the crmd.  The
> > LRMD needs to always perform the actions it is given as soon as possible.
> 
> Seriously, folks, the LRM rewrite may turn out not to be the best
> example of pacemaker's attention to detail ;-)
>

such is any re-write of poorly designed code ;-)  <--- I included the smiley so my jab is acceptable and not in poor taste just like yours! :D <--- I included this smiley because I think it looks funny.

> Yes, the previous LRM had a per-node concurrency limit. This avoided
> overloading the nodes via IO, which is why it was added. (And also
> smoothed out spikes in the monitoring calls should they happen to
> coincide.) Default limit of parallel executions was 4 or half the number
> of CPU cores, if memory serves.
> 
> This turned out to actually improve performance (since it avoided said
> spikes), and avoid timeouts. (While it is true that, given a perfect
> scheduler, the total runtime of N_1..100 being kicked off all at once
> should be equal to N_1..100 being kicked off serially, it's quite
> likely that doing the former will mean at least a few of those 100
> operations hitting its *individual* timeout at the LRM level.)

I'm convinced this useful.

I'll add PCMK_MAX_CHILDREN to the sysconfig documentation.  To be backwards compatible I'll have the lrmd internally interpret your LRMD_MAX_CHILDREN environment variable as well.

sound reasonable?

> 
> The TE doesn't have enough knowledge to enforce this, since it doesn't
> know if monitors get scheduled. The transition timers weren't really a
> problem, since they had some lee-way accounted for.
> 
> If we don't have this functionality right now anymore, I do believe we
> need it back.
> 
> I do seem to recall that at the time, Andrew preferred it to be
> implemented at the LRM level, because it avoided a more complex
> transition graph logic (e.g., the batch-limit functionality on a
> per-node level, and doing something smart about monitors); but my memory
> is hazy on this detail.
> 
> Nowadays, since we have the migration-threshold anyway, it may be
> possible to do something about it cleanly in the TE, but that still
> would leave the monitors unsolved ...
>
> 
> Regards,
>     Lars
> 
> (PS: 1.1.8 really isn't turning out to be my favorite release. If I
> wasn't afraid it'd received as a rant, I'd try to write up a post-mortem
> from my/our perspective to see what might be avoidable in the future.)

We should open this discussion at some point.  As long as it is constructive criticism I doubt it will be perceived as a rant.

I've mentioned to Andrew that we might need to consider doing release candidates. This would at least put some of the responsibility back on the community to verify the release with us before we officially tag it.  We definitely test our code, but it is impossible for us to test everyone's possible deployment use-case.

-- Vossel

> 
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,
> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Pacemaker mailing list