[Pacemaker] Speed up resource failover?

Wed Jan 12 17:20:34 UTC 2011

Hi,

On Wed, Jan 12, 2011 at 09:33:41AM -0700, Patrick H. wrote:
> Sent: Wed Jan 12 2011 09:25:39 GMT-0700 (Mountain Standard Time)
> From: Patrick H. <pacemaker at feystorm.net>
> To: pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] Speed up resource failover?
>> Sent: Wed Jan 12 2011 01:56:31 GMT-0700 (Mountain Standard Time)
>> From: Lars Ellenberg <lars.ellenberg at linbit.com>
>> To: pacemaker at oss.clusterlabs.org
>> Subject: Re: [Pacemaker] Speed up resource failover?
>>> On Wed, Jan 12, 2011 at 09:30:41AM +0100, Robert van Leeuwen wrote:
>>>   
>>>> -----Original message-----
>>>> To:	pacemaker at oss.clusterlabs.org; From:	Patrick H. 
>>>> <pacemaker at feystorm.net>
>>>> Sent:	Wed 12-01-2011 00:06
>>>> Subject:	[Pacemaker] Speed up resource failover?
>>>> Attachment:	inline.txt
>>>>     
>>>>> As it is right now, pacemaker seems to take a long time (in 
>>>>> computer terms) to fail over resources from one node to the 
>>>>> other. Right now, I have 477 IPaddr2 resources evenly distributed 
>>>>> among 2 nodes. When I put one node in standby, it takes 
>>>>> approximately 5 minutes to move the half of those from one node 
>>>>> to the other. And before you ask, theyre because of SSL http 
>>>>> virtual hosting. I have no order rules, colocations or anything 
>>>>> on those resources, so it should be able migrate the entire list 
>>>>> simultaneously, but it seems to do them sequentially. Is there 
>>>>> any way to make it migrate the resources in parallel? Or at the 
>>>>> very least speed it up?
>>>>>       
>>>> Patrick,
>>>>
>>>> It's probably not so much the cluster suite but is has to do with 
>>>> the specific resource script. For a proper takeover of a IP you 
>>>> have to do an arp "deregister/register".
>>>> This will take a few seconds.
>>>>     
>> This is apparently not true :-/
>> I have attached a portion of the lrmd log showing an example of this.  
>> Notice that the very first line it starts the vip_55.63 resource, and  
>> then immediately on the next line it exits successfully.
>> Another point of note is that somehow after the script already exited,  
>> lrmd logs the stderr output from it. I'm not sure if its just delayed  
>> logging or what. However, even if the script is still running, notice  
>> that there is a huge time gap between 16:11:01 and 16:11:25 where its  
>> just sitting there doing nothing.
>> I even did a series of `ps` commands to watch for the processes, and  
>> it starts up a bunch of them, and then they all exit, and it sits  
>> there for a long period before starting up more. So it is definitely  
>> not the resource script slowing it down.
>>
>> Also, in the log, notice that its only starting up a few scripts every  
>> second. It should be able to fire off every single script at the exact  
>> same time.
>>
>>>> As long as a resource script is busy the cluster suite will not start the next action.
>>>> Parallel execution is not possible in the cluster suite as far as I know.
>>>> (without being a programmer myself I would expect it is pretty tricky to implement parallelization "code-wise" and making 100% sure the cluster does not break)
>>>>
>>>> You could consider to edit the IPaddr2 resource script so it does not wait for the arp commands.
>>>> At you're own risk of course ;-)
>>>>     
>>>
>>> There is the cluster option "batch-limit" (in the cib), see
>>> "configuration explained".
>>> and there is lrmd "max-children" (can be set in some /etc/defaults/ or
>>> /etc/sysconfig file, should be set by the init script).
>>> you can set it manually with lrmadmin -p max-children $some_number
>>> That should help you a bit.
>>> But don't overdo. Raise them slowly ;-)
>>>
>>>   
>>
>> batch-limit it says defaults to 30 which seems like a sane value. I  
>> tried playing with the max-children and upped it to 30 as well, but to  
>> no effect. It does seem to be launching 30 instances of the IPaddr2  
>> script at a time (as can be seen from the attached log), but the  
>> problem is apparently that its sitting there for long periods of time  
>> before starting up the next batch. I would think that when one of the  
>> 30 completes, it would launch another to take its place. But instead  
>> it launches 30, then sits there for a while, then launches another 30.

Strange thing. Can you please file a bugzilla with hb_report.
File it initially for the LRM component (product Linux-HA).

> Oh, and its not waiting for the resource to stop on the other node  
> before it starts it up either.
> Here's the lrmd log for resource vip_55.63 from the 'ha02' node (the  
> node I put into standby)
> Jan 12 16:10:24 ha02 lrmd: [5180]: info: rsc:vip_55.63:1444: stop
> Jan 12 16:10:24 ha02 lrmd: [5180]: info: Managed vip_55.63:stop process  
> 19063 exited with return code 0.
>
>
> And here's the lrmd log for the same resource on 'ha01'
> Jan 12 16:10:50 ha01 lrmd: [4707]: info: rsc:vip_55.63:1390: start
> Jan 12 16:10:50 ha01 lrmd: [4707]: info: Managed vip_55.63:start process  
> 8826 exited with return code 0.
>
>
> Notice that it stopped it a full 36 seconds before it tried to start it  
> on the other node. The times on both boxes are in sync, so its not that  
> either.

Is this the case when you wanted to fail-over a single resource
or was it part of the node standby process?

Thanks,

Dejan

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker