[ClusterLabs] reducing peer node death detection time

Tue Aug 4 02:11:03 UTC 2015

> On 25 Jun 2015, at 8:17 am, Nekrasov, Alexander <alexander.nekrasov at emc.com> wrote:
> 
> Hello,
>  
> The problem I’m facing: reducing the time between a node panic and the call to STONITH on the peer node in a two node cluster. Documentation points to the token value in corosync.conf
>  
> totem {
>         version: 2
>         secauth: off
>         threads: 0
>         token:   1000
>         token_retransmits_before_loss_const: 1
>         join:           1000    
>         consensus:      10000
>         interface {
>                 ringnumber: 0
>                 bindnetaddr: 128.221.255.100
>                 mcastaddr: 226.94.1.1
>                 mcastport: 5405
>         }
> }
>  
> Setting token to 1s results in around 5 seconds from real node death to STONITH call on surviving node. Further reduction down to 100ms doesn’t seem to have any effect. Is there a way to further reduce this delay?

There are other aspects to this:

- if the dead node was the DC, we’ll have to elect a new one
- the DC then needs to run the policy engine to determine the new ideal state of the cluster and calculate how to achieve it
- the stonithd must query its peers to see who can fence the failed node
- now stonithd can actually start talking to the fencing agent
- the agent will probably query the device’s view of the node’s state before turning it off

there’s actually a fair bit to be done

>  
> Thanks,
> Alexander
>  
> corosync-debuginfo-1.4.7-0.19.6.8087.0.PTF.916981
> libcorosync4-1.4.7-0.19.6.8087.0.PTF.916981
> corosync-1.4.7-0.19.6.8087.0.PTF.916981
>  
> pacemaker-debuginfo-1.1.11-0.7.53.7419.2.PTF.883076
> libpacemaker3-1.1.11-0.7.53.7419.2.PTF.883076
> pacemaker-1.1.11-0.7.53.7419.2.PTF.883076
>  
>  
>  
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org