[Pacemaker] Speeding up startup after migration

Lars Marowsky-Bree lmb at suse.com
Wed Apr 3 08:58:01 UTC 2013


On 2013-04-02T17:02:01, David Vossel <dvossel at redhat.com> wrote:

> > Seriously, folks, the LRM rewrite may turn out not to be the best
> > example of pacemaker's attention to detail ;-)
> such is any re-write of poorly designed code ;-)  <--- I included the smiley so my jab is acceptable and not in poor taste just like yours! :D <--- I included this smiley because I think it looks funny.

Heh. Well, I admit that the above is the toned-down reminder of a rant.

I was used to every pacemaker release being an almost boring improvement
over the previous; so that set my expectations for 1.1.8, and the effort
we thought we could get away with before shipping it again. When I saw
1.1.8 shaping up, I knew we couldn't ship that as a maintenance update
already, but I was (and still am) taken by surprise just how much effort
it was to get back into shape. From where I stand (speaking as the guy
who probably has to deal with the largest production subset of the
pacemaker community), it was the worst pacemaker release ever.

I realize that the goal of most of the rewrites (libqb, lrmd, handling
of anonymous clones, fencing, lots of logging messages changes, ...)
that went into 1.1.8 was to clean up the code to make it more
maintainable for the future. And that's a good thing. But in the
short-term, the fall-out wasn't nice. If you're on the side of the
rewrite equation that doesn't seem to feel any of the benefits but
mostly pain, it does create a certain tension ;-)

It also showed a couple of areas that apparently *aren't* well protected
by regression tests in pacemaker / the cluster stack, I guess.

I also realize that one of the problems is that, as soon as we realized
that we couldn't ship 1.1.8 as-is, we were forced to shift our effort to
selective backports (since we had to deal with customer issues in
production, whom we couldn't upgrade). That meant that instead of
feeding back to 1.1.8 immediately, we came late to the party with
testing. But the only way I can see to avoid that is keeping the changes
in pacemaker flowing at a more constant and lower rate, giving us time
to integrate and test them. 1.1.8 blew our capacity, and is probably one
of the few pacemaker releases we skipped shipping, and the first we
skipped intentionally.

And yes, I did feel frustration; that didn't seem to be a nice thing to
do to your production deployments. (I know RHT as a company doesn't care
much, because RHT doesn't support pacemaker officially yet.)

So, basically, my frustration stems from the fact that (1.1.8 excepted,
from my PoV) pacemaker has an excellent, continuously improving release
quality, and that was what the plans and expectations were based on
;-)

> I'll add PCMK_MAX_CHILDREN to the sysconfig documentation.  To be backwards compatible I'll have the lrmd internally interpret your LRMD_MAX_CHILDREN environment variable as well.
> 
> sound reasonable?

That makes perfect sense, thank you.


> We should open this discussion at some point.  As long as it is constructive criticism I doubt it will be perceived as a rant.

Well, emotions are likely to creep into it in one or two paragraphs.
Hopefully no swear words in public. ;-)

> I've mentioned to Andrew that we might need to consider doing release
> candidates. This would at least put some of the responsibility back on
> the community to verify the release with us before we officially tag
> it.  We definitely test our code, but it is impossible for us to test
> everyone's possible deployment use-case.

See above. We usually can do that, but 1.1.8 was too much for us to
stomach, and too much for a "smooth upgrade" from 1.1.7 in production.

And, frankly, it took us several months of testing to get where we are
now (and yes, I am *very* grateful that once we reported them, we
received a lot of help from you and Andrew et al); we never needed as
much time and effort to test a pacemaker release. (We seriously
considered not moving to 1.1.8 at all, but continue SLE HA 11 as
1.1.7+backports, but then it was too late already for us to pull
back.)

And previously, pacemaker got away with making such changes because the
PE has *excellent* regression tests, and that was where the majority of
changes happened.

On the plus side, 1.1.8 was a great learning opportunity. ;-)


Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde





More information about the Pacemaker mailing list