[Pacemaker] cluster-delay property

Fri Oct 25 08:24:20 UTC 2013

ok, lesson learned:

until now we have a common network for drbd and corosync
We will split that:
one dedicated to drbd
one dedicated to corosync.

Thank's to everybody

Karl

Quoting Michael Schwartzkopff <ms at sys4.de>:

> Am Donnerstag, 24. Oktober 2013, 10:06:20 schrieben Sie:
>> On 24/10/13 09:01, Michael Schwartzkopff wrote:
>> > Am Donnerstag, 24. Oktober 2013, 14:39:39 schrieb Karl Rößmann:
>> >> Sorry, I try to explain
>> >>
>> >> Hi
>> >>
>> >> In your book you describe a parameter 'deadtime' which defines
>> >> the timeout to declare a node as dead. I want to extend this
>> >> value to 120s to avoid such a scenario
>> >>
>> >> But: in the SuSE documentation I cannot find 'deadtime', instead
>> >> I see a value 'cluster-delay'. My Question is: Are these two
>> >> parameters equivalent ?
>> >>
>> >> More details about the scenario: The I/O load was created by me,
>> >> because I copied a large xen image to an logical volume of the
>> >> cLVM (using 'dd'). I did it several times before without
>> >> problems. Maybe something changed after upgrading tu SLES SP3.
>> >>
>> >> One node, (it was the DC) died, the Xen resources went to the
>> >> surviving node. Fine.
>> >>
>> >> No information in the log file.
>> >>
>> >> On the the surviving node I see: Oct 23 09:30:41 ha2infra
>> >> corosync[9085]:  [TOTEM ] A processor failed, forming new
>> >> configuration.
>> >
>> > (...)
>> >
>> > the log says that corosync did not see the node. This is not a
>> > pacemaker problem.
>> >
>> > I speculate that this happened because one node was heavily
>> > overloaded doing the dd and did not find to process the corosync
>> > tokens in time. Or perhaps the load on the network was so high that
>> > corosync packets were dropped.
>> >
>> > Anyway: This is not a pacemaker problem, it is a corosync problem.
>> >
>> > If you want to make corosync bahave a little bit more relaxed
>> > please see "man corosync.conf" for the options. Look for the
>> > options token and the following options. I don't know what options
>> > are available in SLES11 HAE3. corosync is under heavy improvement
>> > ;-)
>> >
>> > If you have a question for a specific option please ask here on the
>> > list.
>>
>> I agree with Michael that this is a corosync problem. I also agree
>> that this is a congestion problem. The variable you are looking for is
>> token_retransmit, if I am correct.
>>
>> I would argue that the better solution is not to adjust this value,
>> but to fixed your architecture to separate corosync/pacemaker traffic
>> from the disk/dd traffic. If you increase token_retransmit, you will
>> delay how long real failures take to be detected, thus slowing down
>> recovery.
>
> Of course, fiddeling around with the token_retransmit option doesn't  
> solve the
> problem. It just cures the symptoms.
>
> Perhaps you limit the transfer rate of dd. google for "dd rate limit". There
> are several solutions. rsync/csync could be a solution.
>
> Also you could think about improving your disk I/O sub-system.
>
> But you better know what the bottle neck in your system is and how to solve
> it.
>
> Mit freundlichen Grüßen,
>
> Michael Schwartzkopff
>
> --
> [*] sys4 AG
>
> http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
> Franziskanerstraße 15, 81669 München
>
> Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
> Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
> Aufsichtsratsvorsitzender: Florian Kirstein

-- 
Karl Rößmann				Tel. +49-711-689-1657
Max-Planck-Institut FKF       		Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart				email K.Roessmann at fkf.mpg.de