[Pacemaker] SLES 11 SP3 boothd behaviour
Sutherland, Rob
RSutherland at BroadViewNet.com
Wed Aug 27 15:31:21 CEST 2014
-----Original Message-----
From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
Sent: Wednesday, August 27, 2014 5:04 AM
To: pacemaker at oss.clusterlabs.org
Subject: Re: [Pacemaker] SLES 11 SP3 boothd behaviour
Hi,
On Mon, Aug 25, 2014 at 07:43:34PM +0000, Sutherland, Rob wrote:
> Hello all,
>
> We're in the process of implementing geo-redundancy on SLES 11 SP3 (version 0.1.0). We are seeing behavior in which site 2 in a geo-cluster decides that the ticket has expired long before actual expiry. Here's an example time-line:
>
> 1 - All sites (site 1, site 2 and arbitrator) agree on ticket owner and expiry. i.e. site 2 has the ticket with a 60-second expiry:
> Aug 25 10:07:10 linux-4i31 booth-arbitrator: [22526]: info: command:
> 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed Aug
> 25 10:07:10 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket
> -t geo-ticket -S expires -v 1408975690' was executed Aug 25 10:07:10
> bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket
> -S expires -v 1408975690' was executed
>
> 2 - After 48 seconds (80% into lease), all three nodes are still in agreement:
> Site 2:
> Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command:
> 'crm_ticket -t geo-ticket -S owner -v 2' was executed Aug 25 10:07:58
> bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket
> -S expires -v 1408975738' was executed
>
> The arbitrator:
> Aug 25 10:07:58 linux-4i31 crm_ticket[23836]: notice: crm_log_args: Invoked: crm_ticket -t geo-ticket -S owner -v 2
> Aug 25 10:07:58 linux-4i31 booth-arbitrator: [22526]: info: command:
> 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed
>
> Site 1:
> Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command:
> 'crm_ticket -t geo-ticket -S owner -v 2' was executed Aug 25 10:07:58
> bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket
> -S expires -v 1408975738' was executed
>
> 3 - Site 2 decides that the ticket has expired (at the expiry time
> set in step 1) Aug 25 10:08:10 bb5Btas0 booth-site: [27782]: debug: lease expires ...
Strange. Hard to say what happened. It's as if the previous timer somehow survived and triggered the lease expiration.
[Rob] Yes, that was my thought, too.
> 4 - At 10:08:58, both site 1 and the arbitrator expire the lease and pick a new master.
>
> I presume that there was some missed communication between site 2 and the rest of the geo-cluster. There is nothing in the logs to help debug this, though. Any hints on debugging this?
Looks like you already turned debug on. Wasn't there any more debug output? You could also try to watch the wire and see if the hosts can communicate.
[Rob] We are already analyzing some of that.
> BTW: we only ever see this on a site 2 - never a site 1. This is consistent across several labs. Is there a bias towards site 1?
No, I don't think there is.
[Rob] It's odd, then, that we only ever see fail-overs from site 2 to site 1. We'll have to diversify our testing.
If you have a support contract, I'd suggest to open a support ticket with SUSE.
[Rob] Already done. It's unfortunate that SuSE only ships version 0.1.0.
Thanks,
Dejan
> Thanks in advance,
>
> Rob
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list