[Pacemaker] STONITHed node cannot rejoin cluster for over 1000 elections

Fri Oct 12 10:53:13 EDT 2012

Hi Andrew, 

Thanks, I'll compile Pacemaker 1.1.8 and Corosync 1.4.4. Can I leave cluster-glue and resource-agents at the default versions provided with Ubuntu 12.04 (1.0.8 and 3.9.2 respectively), or do I need to upgrade them as well? 

Andrew 

----- Original Message -----

From: "Andrew Beekhof" <andrew at beekhof.net> 
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
Sent: Thursday, October 11, 2012 8:08:13 PM 
Subject: Re: [Pacemaker] STONITHed node cannot rejoin cluster for over 1000 elections 

On Fri, Oct 12, 2012 at 7:12 AM, Andrew Martin <amartin at xes-inc.com> wrote: 
> Hello, 
> 
> 
> I am running a 3-node Corosync+Pacemaker cluster with 2 "real" nodes running resources (storage0 and storage1) and a quorum node (storagequorum) in standby mode. All of the nodes run Ubuntu 12.04 server amd64. There are two corosync rings: 
> rrp_mode: active 
> 
> interface { 
> # the common LAN 
> ringnumber: 0 
> bindnetaddr: 10.10.0.0 
> mcastaddr: 226.94.1.1 
> mcastport: 5405 
> } 
> 
> interface { 
> # the STONITH network 
> ringnumber: 1 
> bindnetaddr: 192.168.7.0 
> mcastaddr: 226.94.1.2 
> mcastport: 5407 
> } 
> 
> DRBD is configured to use /usr/lib/drbd/crm-fence-peer.sh to fence the peer node. 
> 
> There are 3 active interfaces on storage[01]: the common LAN, the STONITH network, and the DRBD replication link. The storagequorum node only has the common LAN and STONITH networks. When looking through the logs, note that the IP addresses for each node are assigned as follows: 
> 
> storage0: xxx.xxx.xxx.148 
> storage1: xxx.xxx.xxx.149 
> storagequorum: xxx.xxx.xxx.24 
> 
> Storage0 and storage1 also had a secondary link to the common LAN which has now been disabled (xxx.xxx.xxx.162 and xxx.xxx.xxx.163 respectively). You still may see it show up in the log, e.g. 
> Oct 5 22:17:39 storagequorum crmd: [7873]: info: crm_update_peer: Node storage1: id=587281418 state=lost addr=r(0) ip(10.10.1.163) r(1) ip(192.168.7.149) votes=1 born=1828352 seen=1828368 proc=00000000000000000000000000111312 (new) 
> 
> Here is the CIB configuration: 
> http://pastebin.com/6TPkWtbt 
> 
> As you can see, the drbd-fence-by-handler-ms_drbd_drives primitive keeps getting added into the configuration but doesn't seem to get removed. 
> 
> 
> I recently tried running a failover test by performing "crm resource migrate g_store" when the resources were running on storage1. The ocf:heartbeat:exportfs resources failed to stop due to wait_for_leasetime_on_stop being true (I am going to set this to false now because I don't need NFSv4 support). Recognizing this problem, the cluster correctly STONITHed storage1 and migrated the resources to storage0. However, once storage1 finished rebooting, it was unable to join the cluster (crm_mon shows it as [offline]). I have uploaded the syslog from the DC (storagequorum) from this time period here: http://sources.xes-inc.com/downloads/storagequorum.syslog.log . Initially after the STONITH it seems like storage1 rejoins the cluster successfully: 
> Oct 5 22:17:39 storagequorum cib: [7869]: info: crm_update_peer: Node storage1: id=352400394 state=member (new) addr=r(0) ip(10.10.1.149) r(1) ip(192.168.7.149) (new) votes=1 born=1828384 seen=1828384 proc=00000000000000000000000000111312 
> 
> However, later it becomes apparent that it cannot join: 
> Oct 5 22:17:58 storagequorum crmd: [7873]: notice: do_election_count_vote: Election 15 (current: 15, owner: storagequorum): Processed no-vote from storage1 (Peer is not part of our cluster) 
> .... 
> Oct 6 03:49:58 storagequorum crmd: [18566]: notice: do_election_count_vote: Election 989 (current: 1, owner: storage1): Processed vote from storage1 (Peer is not part of our cluster) 
> 
> Around 1000 election cycles occur before storage1 is brought back into the cluster. What is the cause of this and how can I modify my cluster configuration to have nodes rejoin right away? 

Its not a configuration issue, you're hitting one or more bugs. 

You seem to be using 1.1.6, can I suggest an upgrade to 1.1.8? I 
recall fixing related issues in the last month or so. 
Also consider an updated corosync, there were some related fixes there too. 

> 
> 
> Thanks, 
> 
> Andrew Martin 
> 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121012/5c2c7a1f/attachment-0003.html>