[Pacemaker] Serious issue with booth for site failover
Yves Trudeau
y.trudeau at videotron.ca
Sat Jan 19 17:19:46 UTC 2013
Hi,
Forget this, everything is fine. An iptables rule was missing in my
failure test.
Regards,
Yves
Le 2013-01-18 13:24, Yves Trudeau a écrit :
> Hi,
> learning about the paxos protocol, I realize the problem is not with
> the arbitrator, but the surviving node. Here its debug output:
>
> booth-site[2552]: 2013/01/18_11:26:36 debug: preposer prepare ...
> booth-site[2552]: 2013/01/18_11:26:36 debug: enter lease_prepare
> booth-site[2552]: 2013/01/18_11:26:36 debug: exit lease_prepare
> booth-site[2552]: 2013/01/18_11:26:36 debug: acceptor promise ...
> booth-site[2552]: 2013/01/18_11:26:36 debug: enter lease_promise
> booth-site[2552]: 2013/01/18_11:26:36 debug: enter start_lease_promise
> booth-site[2552]: 2013/01/18_11:26:36 debug: has not been leased
> booth-site[2552]: 2013/01/18_11:26:36 debug: exit start_lease_promise
> booth-site[2552]: 2013/01/18_11:26:36 debug: exit lease_promise
> booth-site[2552]: 2013/01/18_11:26:36 debug: proposer propose ...
> booth-site[2552]: 2013/01/18_11:26:36 debug: enter lease_is_prepared
> booth-site[2552]: 2013/01/18_11:26:36 debug: enter start_lease_is_prepared
> booth-site[2552]: 2013/01/18_11:26:36 debug: not leased
> booth-site[2552]: 2013/01/18_11:26:36 debug: exit lease_is_prepared
> booth-site[2552]: 2013/01/18_11:26:48 debug: lease_retry ...
> booth-site[2552]: 2013/01/18_11:26:48 debug: preposer prepare ...
> booth-site[2552]: 2013/01/18_11:26:48 debug: enter lease_prepare
> booth-site[2552]: 2013/01/18_11:26:48 debug: exit lease_prepare
> booth-site[2552]: 2013/01/18_11:26:48 debug: acceptor promise ...
> booth-site[2552]: 2013/01/18_11:26:48 debug: enter lease_promise
> booth-site[2552]: 2013/01/18_11:26:48 debug: enter start_lease_promise
> booth-site[2552]: 2013/01/18_11:26:48 debug: has not been leased
> booth-site[2552]: 2013/01/18_11:26:48 debug: exit start_lease_promise
> booth-site[2552]: 2013/01/18_11:26:48 debug: exit lease_promise
> booth-site[2552]: 2013/01/18_11:26:48 debug: proposer propose ...
> booth-site[2552]: 2013/01/18_11:26:48 debug: enter lease_is_prepared
> booth-site[2552]: 2013/01/18_11:26:48 debug: enter start_lease_is_prepared
> booth-site[2552]: 2013/01/18_11:26:48 debug: not leased
> booth-site[2552]: 2013/01/18_11:26:48 debug: exit lease_is_prepared
>
> Also, I don't know if it makes a difference but the test VMs are 32 bits.
>
> Regards,
>
> Yves
>
> Le 2013-01-18 11:49, Yves Trudeau a écrit :
>> Hi,
>> working on a geo-redundant setup, I uncovered a problem with booth.
>> In order to simplify, I did an experiment with only booth, no
>> pacemaker. The behavior is the same with pacemaker.
>>
>> Version used
>> ------------
>>
>> git log
>> commit 55ab027233407fd44850f0c4905b085205d55f64
>> Author: Xia Li <xli at suse.com>
>> Date: Thu Jan 10 13:48:20 2013 +0800
>>
>> Config file
>> -----------
>>
>> transport="UDP"
>> port="6666"
>> arbitrator="10.3.3.1"
>> site="10.3.1.10"
>> site="10.3.2.10"
>> ticket="ticketMaster;120"
>>
>> *same on all nodes.
>>
>> Invocations
>> -----------
>>
>> root at 10.3.3.1:~# /usr/sbin/boothd arbitrator -D
>>
>> root at 10.3.1.10:~# /usr/sbin/boothd site -D
>>
>> root at 10.3.2.10:~# /usr/sbin/boothd site -D
>>
>> Initial state
>> -------------
>>
>> root at 10.3.3.1:~# booth client list
>> ticket: ticketMaster, owner: None, expires: INF
>>
>> * same on all 3 nodes
>>
>> Granting the ticket
>> -------------------
>>
>> root at 10.3.3.1:~# booth client grant -t ticketMaster -s 10.3.2.10
>> cluster[25103]: 2013/01/18_11:16:35 info: grant command sent, result
>> will be returned asynchronously, you can get the result from the log
>> files
>>
>> Status after grant
>> ------------------
>>
>> root at 10.3.3.1:~# booth client list
>> ticket: ticketMaster, owner: 10.3.2.10, expires: 2013/01/18 11:20:11
>>
>> * same on all 3 nodes, so far so good
>>
>> Simulation a network outage on 10.3.2.10
>> ----------------------------------------
>>
>> root at 10.3.2.10:~# iptables -I INPUT -s 10.3.1.0/24 -j DROP; iptables -I
>> INPUT -s 10.3.3.0/24 -j DROP; iptables -I OUTPUT -d 10.3.1.0/24 -j DROP;
>> iptables -I OUTPUT -d 10.3.2.0/24 -j DROP
>>
>> after the outage, here the last lines of the arbitrator:
>>
>> booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: exit
>> start_lease_promise
>> booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: exit lease_promise
>> booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: acceptor promise ...
>> booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: ballot number: 4,
>> highest promised: 5
>> booth-arbitrator[25055]: 2013/01/18_11:28:11 debug: lease expires ...
>> booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
>> -t ticketMaster -S owner -v -1' was executed
>> Error signing on to the CIB service: connection failed
>> booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
>> -t ticketMaster -S expires -v 0' was executed
>> Error signing on to the CIB service: connection failed
>> booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
>> -t ticketMaster -S ballot -v 2' was executed
>> Error signing on to the CIB service: connection failed
>> booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
>> -t ticketMaster -r --force' was executed
>> Error signing on to the CIB service: connection failed
>> booth-arbitrator[25055]: 2013/01/18_11:28:11 debug: only proposer can do
>> this
>> booth-arbitrator[25055]: 2013/01/18_11:28:23 debug: lease_retry ...
>> booth-arbitrator[25055]: 2013/01/18_11:28:23 debug: only proposer can do
>> this
>>
>> and of course:
>>
>> root at 10.3.1.10:~# booth client list
>> ticket: ticketMaster, owner: None, expires: INF
>>
>> The debug message "only proposer can do this" comes from the
>> paxos_round_request functions in paxos.c with the condition:
>>
>> if (!(pi->ps->role[myid] & PROPOSER)) {
>> log_debug("only proposer can do this");
>> return -EOPNOTSUPP;
>> }
>>
>> So my pick is the the PROPOSER bit is not set correctly in the structure
>> in the lease_expires and lease_retry functions. I am not very familiar
>> with that code base but I'll try to figure out the issue and submit a
>> patch on git hub.
>>
>> Regards,
>>
>> Yves
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list