[Pacemaker] Speeding up startup after migration

Mon Apr 1 13:28:40 EDT 2013

01.04.2013 20:09, David Vossel wrote:
> ----- Original Message -----
>> From: "Vladislav Bogdanov" <bubble at hoster-ok.com>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Monday, April 1, 2013 10:35:39 AM
>> Subject: Re: [Pacemaker] Speeding up startup after migration
>>
>> 01.04.2013 17:28, David Vossel пишет:
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Vladislav Bogdanov" <bubble at hoster-ok.com>
>>>> To: pacemaker at oss.clusterlabs.org
>>>> Sent: Friday, March 29, 2013 2:03:27 AM
>>>> Subject: Re: [Pacemaker] Speeding up startup after migration
>>>>
>>>> 29.03.2013 03:31, Andrew Beekhof wrote:
>>>>> On Fri, Mar 29, 2013 at 4:12 AM, Benjamin Kiessling
>>>>> <mittagessen at l.unchti.me> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> we've got a small pacemaker cluster running which controls an
>>>>>> active/passive router. On this cluster we've got a semi-large (~30)
>>>>>> number of primitives which are grouped together. On migration it takes
>>>>>> quite a long time until each resource is brought up again because they
>>>>>> are started sequentially. Is there a way to speed up the process,
>>>>>> ideally to execute these resource agents in parallel? They are fully
>>>>>> independent so the order in which they finish is of no concern.
>>>>>
>>>>> I'm guessing you have them in a group?  "Don't do that" and they will
>>>>> fail over in parallel.
>>>>
>>>> Does current lrmd implementation have batch-limit like cluster-glue's
>>>> one had? Can't find where is it.
>>>
>>> The batch-limit option is still around, but has nothing to do with
>>> the lrmd. It does limit how many resources can execute in parallel, but at
>>> the transition engine level rather than the lrmd.
>>
>> Yep, I know that option, it was there for a very long time.
>>
>> So, if I understand correctly, new lrmd runs as many simultaneous jobs
>> as possible. Unfortunately, in some circumstances this would result in
>> the high node load and timeouts. Is there a way to some-how limit that load?
> 
> Isn't that what the batch-limit option does? or are you saying you
> want a batch limit type option that is node specific? Why are you
> concerned about this behavior living in the LRMD instead of at the
> transition processing level?

There was a limit in a glue's lrmd, and I think it was there for reason.
I do not know which behavior is better, they are just different.

> 
> I believe if we do any batch limiting type behavior at the LRMD
> level we're going to run into problems with the transition timers in the crmd.

Did that change in crmd after lrmd replacement?

> The LRMD needs to always perform the actions it is given as soon as possible.

Yes, but... heavy load on a host (because of f.e. 150 CPU-intensive
operations run in parallel) may cause f.e. monitor timeouts and then
resource restarts and then stop timeouts and fencing.