[Pacemaker] Need to relax corosync due to backup of VM through snapshot

Thu Nov 21 08:26:58 EST 2013

On Thu, Nov 21, 2013 at 9:09 AM, Lars Marowsky-Bree wrote:
> On 2013-11-20T16:58:01, Gianluca Cecchi <gianluca.cecchi at gmail.com> wrote:
>
>> Based on docs  I thought that the timeout should be
>>
>> token x token_retransmits_before_loss_const
>
> No, the comments in the corosync.conf.example and man corosync.conf
> should be pretty clear, I hope. Can you recommend which phrasing we
> should improve?

I have not understood exact relationship between token and
token_retransmits_before_loss_const.
When one comes into play and when the other one...
So perhaps the second one could be given more details.
Or some web links

>
>> SO my current test config is:
>>   # diff corosync.conf corosync.conf.pre181113
>> 24,25c24
>> < #token: 5000
>> < token: 120000
>
> A 120s node timeout? That is really, really long. Why is the backup tool
> interfering with the scheduling of high priority processes so much? That
> sounds like the real bug.

In fact I inherited analysis for a previous production cluster and I'm
setting up a test environment to demonstrate that one of the realistic
outputs could well be that a cluster is not the right solution
implemented because the underlying infra is not stable enough.
I'm not given a great visibility for what is VMware and SAN details,
but I'm stressing to get them.
I saw sometimes disk latencies going at 8000milliseceonds.... ;-(
SO another possible output could be to make a more reliable infra
before going with cluster.
I'm putting deliberately high values to see what happens and lower
them step by step
BTW: I remember in the past some thread with other having problems
with Netbackup (or similar backup software ) using snapshot and that
putting higher values solved the sporadic problems (possibly 20000 for
token and 10 for retransmit but I couldn't find them ...)

>
>> Any comment?
>> Any different strategies successfully used in similar environments
>> where high latencies get in place at snapshot deletion when
>> consolidate phase of disks is executed?
>
> A setup where a VM apparently can freeze for almost 120s is not suitable
> for HA.
>

I see from previous logs that sometimes drbd disconnect and reconnect
only after 30-40 seconds with default timeouts...

Thanks for your inputs.

Gianluca