[Pacemaker] token lost - need clarification

Marco Felettigh marco at nucleus.it
Wed Dec 18 05:23:11 EST 2013


On Tue, 17 Dec 2013 09:28:51 +0100
Michael Schwartzkopff <ms at sys4.de> wrote:

> Am Dienstag, 17. Dezember 2013, 09:17:31 schrieb marco at nucleus.it:
> > Hi to all,
> > i set up a 2 node cluster with a cross cable between the two nodes
> > without stonith ; i know this is not the best way but this is the
> > scenario i need at that time.
> > 
> > I know the releases are old:
> > corosync-1.2.7-1.2
> > libcorosync-1.2.7-1.2
> > pacemaker-1.0.10-1.4
> > libpacemaker3-1.0.10-1.4
> > 
> > Everything was ok for some days/months but a few day ago without
> > network interruption ( no messages relative to ethernet modules or
> > errors in network statistics or notifications by nagios ping checks
> > ) between the two nodes something went wrong.
> > 
> > From what i try to understand from the logs attached :
> > Token Timeout (10000 ms) retransmit timeout (980 ms)
> > token hold (774 ms) retransmits before loss (10 retrans)
> > 
> > 
> > the 2 nodes lost a token and they try to solve the situation but
> > node1 think node2 is up:
> > 
> > Dec  7 05:01:41 node1 pengine: [1138]: info:
> > determine_online_status: Node node2 is online
> > Dec  7 05:01:41 node1 pengine: [1138]: info:
> > determine_online_status: Node node1 is online
> > 
> > and then lost
> > 
> > Dec  7 05:01:54 node1 corosync[1128]:   [pcmk  ] info:
> > ais_mark_unseen_peer_dead: Node node2 was not seen in the previous
> > transition
> > Dec  7 05:01:54 node1 corosync[1128]:   [pcmk  ] info:
> > update_member: Node 33559980/node2 is now: lost
> > 
> > while node2 think node1 was gone:
> > 
> > Dec  7 05:01:34 node2 corosync[6356]:   [pcmk  ] info:
> > ais_mark_unseen_peer_dead: Node node1 was not seen in the previous
> > transition Dec  7 05:01:34 node2 corosync[6356]:   [pcmk  ] info:
> > update_member: Node 16782764/node1 is now: lost
> > 
> > then they go in spilt brain .
> > Any suggestion about why node1 saw node2 ath the first time while
> > node2 declared immediately lost node1 ?
> 
> This depends who initiates the round. Both nodes recognized the
> failure within 20 seconds. This is ok. Especially if you allow 10
> Sekunds for a token timeout.
> 
> Mit freundlichen Grüßen,
> 
> Michael Schwartzkopff
> 

Ok that is fine but it is very strange without network loss between the
nodes that they cannot resend the token and later restablish the
quorum :( .

Marco




More information about the Pacemaker mailing list