[Pacemaker] STONITHed node cannot rejoin cluster for over 1000 elections

Tue Oct 16 15:46:56 CEST 2012

Andrew, 

The pacemaker package built by Ubuntu requires the following dependencies (besides corosync, resource-agents, and cluster-glue): 
*libccs3 libcib1 *libcman3 libcrmcluster1 libcrmcommon2 *libesmtp6 *libfence4 libpe-rules2 libpe-status3 libpengine3 libstonithd1 libtransitioner1 

It appears that compiling pacemaker from source includes all of these dependencies except those marked with an asterisk above. I can install libesmtp6 and libcman3 without problem, but libfence4 (and dependency libccs3) require libconfdb4 and libcoroipcc4, which are now both part of the corosync 1.4.4 package that I compiled. Do I also need to build libccs3 and libfence4, or are these libraries deprecated in pacemaker 1.1.8 (I don't see them listed on https://github.com/ClusterLabs/pacemaker/blob/master/README.markdown)? 

Similarly, the Ubuntu corosync package requires the following: libcfg4 libconfdb4 libcoroipcc4 libcoroipcs4 libcpg4 libevs4 liblogsys4 libpload4 libquorum4 libsam4 libtotem-pg4 libvotequorum4 , however all of these appear to be built into corosync when compiled from source. 

Thanks, 

Andrew 

----- Original Message -----

From: "Andrew Beekhof" <andrew at beekhof.net> 
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
Sent: Monday, October 15, 2012 5:31:51 AM 
Subject: Re: [Pacemaker] STONITHed node cannot rejoin cluster for over 1000 elections 

On Sat, Oct 13, 2012 at 1:53 AM, Andrew Martin <amartin at xes-inc.com> wrote: 
> Hi Andrew, 
> 
> Thanks, I'll compile Pacemaker 1.1.8 and Corosync 1.4.4. Can I leave 
> cluster-glue and resource-agents at the default versions provided with 
> Ubuntu 12.04 (1.0.8 and 3.9.2 respectively), or do I need to upgrade them as 
> well? 

Should be fine. You will need to obtain a recent libqb build though. 

> 
> Andrew 
> 
> ________________________________ 
> From: "Andrew Beekhof" <andrew at beekhof.net> 
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
> Sent: Thursday, October 11, 2012 8:08:13 PM 
> Subject: Re: [Pacemaker] STONITHed node cannot rejoin cluster for over 1000 
> elections 
> 
> 
> On Fri, Oct 12, 2012 at 7:12 AM, Andrew Martin <amartin at xes-inc.com> wrote: 
>> Hello, 
>> 
>> 
>> I am running a 3-node Corosync+Pacemaker cluster with 2 "real" nodes 
>> running resources (storage0 and storage1) and a quorum node (storagequorum) 
>> in standby mode. All of the nodes run Ubuntu 12.04 server amd64. There are 
>> two corosync rings: 
>> rrp_mode: active 
>> 
>> interface { 
>> # the common LAN 
>> ringnumber: 0 
>> bindnetaddr: 10.10.0.0 
>> mcastaddr: 226.94.1.1 
>> mcastport: 5405 
>> } 
>> 
>> interface { 
>> # the STONITH network 
>> ringnumber: 1 
>> bindnetaddr: 192.168.7.0 
>> mcastaddr: 226.94.1.2 
>> mcastport: 5407 
>> } 
>> 
>> DRBD is configured to use /usr/lib/drbd/crm-fence-peer.sh to fence the 
>> peer node. 
>> 
>> There are 3 active interfaces on storage[01]: the common LAN, the STONITH 
>> network, and the DRBD replication link. The storagequorum node only has the 
>> common LAN and STONITH networks. When looking through the logs, note that 
>> the IP addresses for each node are assigned as follows: 
>> 
>> storage0: xxx.xxx.xxx.148 
>> storage1: xxx.xxx.xxx.149 
>> storagequorum: xxx.xxx.xxx.24 
>> 
>> Storage0 and storage1 also had a secondary link to the common LAN which 
>> has now been disabled (xxx.xxx.xxx.162 and xxx.xxx.xxx.163 respectively). 
>> You still may see it show up in the log, e.g. 
>> Oct 5 22:17:39 storagequorum crmd: [7873]: info: crm_update_peer: Node 
>> storage1: id=587281418 state=lost addr=r(0) ip(10.10.1.163) r(1) 
>> ip(192.168.7.149) votes=1 born=1828352 seen=1828368 
>> proc=00000000000000000000000000111312 (new) 
>> 
>> Here is the CIB configuration: 
>> http://pastebin.com/6TPkWtbt 
>> 
>> As you can see, the drbd-fence-by-handler-ms_drbd_drives primitive keeps 
>> getting added into the configuration but doesn't seem to get removed. 
>> 
>> 
>> I recently tried running a failover test by performing "crm resource 
>> migrate g_store" when the resources were running on storage1. The 
>> ocf:heartbeat:exportfs resources failed to stop due to 
>> wait_for_leasetime_on_stop being true (I am going to set this to false now 
>> because I don't need NFSv4 support). Recognizing this problem, the cluster 
>> correctly STONITHed storage1 and migrated the resources to storage0. 
>> However, once storage1 finished rebooting, it was unable to join the cluster 
>> (crm_mon shows it as [offline]). I have uploaded the syslog from the DC 
>> (storagequorum) from this time period here: 
>> http://sources.xes-inc.com/downloads/storagequorum.syslog.log . Initially 
>> after the STONITH it seems like storage1 rejoins the cluster successfully: 
>> Oct 5 22:17:39 storagequorum cib: [7869]: info: crm_update_peer: Node 
>> storage1: id=352400394 state=member (new) addr=r(0) ip(10.10.1.149) r(1) 
>> ip(192.168.7.149) (new) votes=1 born=1828384 seen=1828384 
>> proc=00000000000000000000000000111312 
>> 
>> However, later it becomes apparent that it cannot join: 
>> Oct 5 22:17:58 storagequorum crmd: [7873]: notice: 
>> do_election_count_vote: Election 15 (current: 15, owner: storagequorum): 
>> Processed no-vote from storage1 (Peer is not part of our cluster) 
>> .... 
>> Oct 6 03:49:58 storagequorum crmd: [18566]: notice: 
>> do_election_count_vote: Election 989 (current: 1, owner: storage1): 
>> Processed vote from storage1 (Peer is not part of our cluster) 
>> 
>> Around 1000 election cycles occur before storage1 is brought back into the 
>> cluster. What is the cause of this and how can I modify my cluster 
>> configuration to have nodes rejoin right away? 
> 
> Its not a configuration issue, you're hitting one or more bugs. 
> 
> You seem to be using 1.1.6, can I suggest an upgrade to 1.1.8? I 
> recall fixing related issues in the last month or so. 
> Also consider an updated corosync, there were some related fixes there too. 
> 
>> 
>> 
>> Thanks, 
>> 
>> Andrew Martin 
>> 
>> 
>> _______________________________________________ 
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121016/5cd7cef4/attachment.html>