[ClusterLabs] reducing peer node death detection time
Andrew Beekhof
andrew at beekhof.net
Mon Aug 3 22:11:03 EDT 2015
> On 25 Jun 2015, at 8:17 am, Nekrasov, Alexander <alexander.nekrasov at emc.com> wrote:
>
> Hello,
>
> The problem I’m facing: reducing the time between a node panic and the call to STONITH on the peer node in a two node cluster. Documentation points to the token value in corosync.conf
>
> totem {
> version: 2
> secauth: off
> threads: 0
> token: 1000
> token_retransmits_before_loss_const: 1
> join: 1000
> consensus: 10000
> interface {
> ringnumber: 0
> bindnetaddr: 128.221.255.100
> mcastaddr: 226.94.1.1
> mcastport: 5405
> }
> }
>
> Setting token to 1s results in around 5 seconds from real node death to STONITH call on surviving node. Further reduction down to 100ms doesn’t seem to have any effect. Is there a way to further reduce this delay?
There are other aspects to this:
- if the dead node was the DC, we’ll have to elect a new one
- the DC then needs to run the policy engine to determine the new ideal state of the cluster and calculate how to achieve it
- the stonithd must query its peers to see who can fence the failed node
- now stonithd can actually start talking to the fencing agent
- the agent will probably query the device’s view of the node’s state before turning it off
there’s actually a fair bit to be done
>
> Thanks,
> Alexander
>
> corosync-debuginfo-1.4.7-0.19.6.8087.0.PTF.916981
> libcorosync4-1.4.7-0.19.6.8087.0.PTF.916981
> corosync-1.4.7-0.19.6.8087.0.PTF.916981
>
> pacemaker-debuginfo-1.1.11-0.7.53.7419.2.PTF.883076
> libpacemaker3-1.1.11-0.7.53.7419.2.PTF.883076
> pacemaker-1.1.11-0.7.53.7419.2.PTF.883076
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list