[Pacemaker] SLES 11 SP3 boothd behaviour
Sutherland, Rob
RSutherland at BroadViewNet.com
Mon Aug 25 21:43:34 CEST 2014
Hello all,
We're in the process of implementing geo-redundancy on SLES 11 SP3 (version 0.1.0). We are seeing behavior in which site 2 in a geo-cluster decides that the ticket has expired long before actual expiry. Here's an example time-line:
1 - All sites (site 1, site 2 and arbitrator) agree on ticket owner and expiry. i.e. site 2 has the ticket with a 60-second expiry:
Aug 25 10:07:10 linux-4i31 booth-arbitrator: [22526]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed
Aug 25 10:07:10 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed
Aug 25 10:07:10 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed
2 - After 48 seconds (80% into lease), all three nodes are still in agreement:
Site 2:
Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket -S owner -v 2' was executed
Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed
The arbitrator:
Aug 25 10:07:58 linux-4i31 crm_ticket[23836]: notice: crm_log_args: Invoked: crm_ticket -t geo-ticket -S owner -v 2
Aug 25 10:07:58 linux-4i31 booth-arbitrator: [22526]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed
Site 1:
Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket -S owner -v 2' was executed
Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed
3 - Site 2 decides that the ticket has expired (at the expiry time set in step 1)
Aug 25 10:08:10 bb5Btas0 booth-site: [27782]: debug: lease expires ...
4 - At 10:08:58, both site 1 and the arbitrator expire the lease and pick a new master.
I presume that there was some missed communication between site 2 and the rest of the geo-cluster. There is nothing in the logs to help debug this, though. Any hints on debugging this?
BTW: we only ever see this on a site 2 - never a site 1. This is consistent across several labs. Is there a bias towards site 1?
Thanks in advance,
Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140825/e9cd4c1d/attachment.html>
More information about the Pacemaker
mailing list