<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: Times New Roman; font-size: 12pt; color: #000000'>Andrew,<br><br>The pacemaker package built by Ubuntu requires the following dependencies (besides corosync, resource-agents, and cluster-glue):<br>*libccs3 libcib1 *libcman3 libcrmcluster1 libcrmcommon2 *libesmtp6 *libfence4 libpe-rules2 libpe-status3 libpengine3 libstonithd1 libtransitioner1 <br><br>It appears that compiling pacemaker from source includes all of these dependencies except those marked with an asterisk above. I can install libesmtp6 and libcman3 without problem, but libfence4 (and dependency libccs3) require libconfdb4 and libcoroipcc4, which are now both part of the corosync 1.4.4 package that I compiled. Do I also need to build libccs3 and libfence4, or are these libraries deprecated in pacemaker 1.1.8 (I don't see them listed on https://github.com/ClusterLabs/pacemaker/blob/master/README.markdown)?<br><br>Similarly, the Ubuntu corosync package requires the following: libcfg4 libconfdb4 libcoroipcc4 libcoroipcs4 libcpg4 libevs4 liblogsys4 libpload4 libquorum4 libsam4 libtotem-pg4 libvotequorum4, however all of these appear to be built into corosync when compiled from source.<br><br>Thanks,<br><br>Andrew<br><br><hr id="zwchr"><div style="color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Andrew Beekhof" &lt;andrew@beekhof.net&gt;<br><b>To: </b>"The Pacemaker cluster resource manager" &lt;pacemaker@oss.clusterlabs.org&gt;<br><b>Sent: </b>Monday, October 15, 2012 5:31:51 AM<br><b>Subject: </b>Re: [Pacemaker] STONITHed node cannot rejoin cluster for over 1000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elections<br><br>On Sat, Oct 13, 2012 at 1:53 AM, Andrew Martin &lt;amartin@xes-inc.com&gt; wrote:<br>&gt; Hi Andrew,<br>&gt;<br>&gt; Thanks, I'll compile Pacemaker 1.1.8 and Corosync 1.4.4. Can I leave<br>&gt; cluster-glue and resource-agents at the default versions provided with<br>&gt; Ubuntu 12.04 (1.0.8 and 3.9.2 respectively), or do I need to upgrade them as<br>&gt; well?<br><br>Should be fine. You will need to obtain a recent libqb build though.<br><br>&gt;<br>&gt; Andrew<br>&gt;<br>&gt; ________________________________<br>&gt; From: "Andrew Beekhof" &lt;andrew@beekhof.net&gt;<br>&gt; To: "The Pacemaker cluster resource manager" &lt;pacemaker@oss.clusterlabs.org&gt;<br>&gt; Sent: Thursday, October 11, 2012 8:08:13 PM<br>&gt; Subject: Re: [Pacemaker] STONITHed node cannot rejoin cluster for over 1000<br>&gt; elections<br>&gt;<br>&gt;<br>&gt; On Fri, Oct 12, 2012 at 7:12 AM, Andrew Martin &lt;amartin@xes-inc.com&gt; wrote:<br>&gt;&gt; Hello,<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; I am running a 3-node Corosync+Pacemaker cluster with 2 "real" nodes<br>&gt;&gt; running resources (storage0 and storage1) and a quorum node (storagequorum)<br>&gt;&gt; in standby mode. All of the nodes run Ubuntu 12.04 server amd64. There are<br>&gt;&gt; two corosync rings:<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; rrp_mode: active<br>&gt;&gt;<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; interface {<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # the common LAN<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ringnumber: 0<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bindnetaddr: 10.10.0.0<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mcastaddr: 226.94.1.1<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mcastport: 5405<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; }<br>&gt;&gt;<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; interface {<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # the STONITH network<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ringnumber: 1<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bindnetaddr: 192.168.7.0<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mcastaddr: 226.94.1.2<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mcastport: 5407<br>&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; }<br>&gt;&gt;<br>&gt;&gt; DRBD is configured to use /usr/lib/drbd/crm-fence-peer.sh to fence the<br>&gt;&gt; peer node.<br>&gt;&gt;<br>&gt;&gt; There are 3 active interfaces on storage[01]: the common LAN, the STONITH<br>&gt;&gt; network, and the DRBD replication link. The storagequorum node only has the<br>&gt;&gt; common LAN and STONITH networks. When looking through the logs, note that<br>&gt;&gt; the IP addresses for each node are assigned as follows:<br>&gt;&gt;<br>&gt;&gt; storage0: xxx.xxx.xxx.148<br>&gt;&gt; storage1: xxx.xxx.xxx.149<br>&gt;&gt; storagequorum: xxx.xxx.xxx.24<br>&gt;&gt;<br>&gt;&gt; Storage0 and storage1 also had a secondary link to the common LAN which<br>&gt;&gt; has now been disabled (xxx.xxx.xxx.162 and xxx.xxx.xxx.163 respectively).<br>&gt;&gt; You still may see it show up in the log, e.g.<br>&gt;&gt; Oct &nbsp;5 22:17:39 storagequorum crmd: [7873]: info: crm_update_peer: Node<br>&gt;&gt; storage1: id=587281418 state=lost addr=r(0) ip(10.10.1.163) r(1)<br>&gt;&gt; ip(192.168.7.149) &nbsp;votes=1 born=1828352 seen=1828368<br>&gt;&gt; proc=00000000000000000000000000111312 (new)<br>&gt;&gt;<br>&gt;&gt; Here is the CIB configuration:<br>&gt;&gt; http://pastebin.com/6TPkWtbt<br>&gt;&gt;<br>&gt;&gt; As you can see, the drbd-fence-by-handler-ms_drbd_drives primitive keeps<br>&gt;&gt; getting added into the configuration but doesn't seem to get removed.<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; I recently tried running a failover test by performing "crm resource<br>&gt;&gt; migrate g_store" when the resources were running on storage1. The<br>&gt;&gt; ocf:heartbeat:exportfs resources failed to stop due to<br>&gt;&gt; wait_for_leasetime_on_stop being true (I am going to set this to false now<br>&gt;&gt; because I don't need NFSv4 support). Recognizing this problem, the cluster<br>&gt;&gt; correctly STONITHed storage1 and migrated the resources to storage0.<br>&gt;&gt; However, once storage1 finished rebooting, it was unable to join the cluster<br>&gt;&gt; (crm_mon shows it as [offline]). I have uploaded the syslog from the DC<br>&gt;&gt; (storagequorum) from this time period here:<br>&gt;&gt; http://sources.xes-inc.com/downloads/storagequorum.syslog.log . Initially<br>&gt;&gt; after the STONITH it seems like storage1 rejoins the cluster successfully:<br>&gt;&gt; Oct &nbsp;5 22:17:39 storagequorum cib: [7869]: info: crm_update_peer: Node<br>&gt;&gt; storage1: id=352400394 state=member (new) addr=r(0) ip(10.10.1.149) r(1)<br>&gt;&gt; ip(192.168.7.149) &nbsp;(new) votes=1 born=1828384 seen=1828384<br>&gt;&gt; proc=00000000000000000000000000111312<br>&gt;&gt;<br>&gt;&gt; However, later it becomes apparent that it cannot join:<br>&gt;&gt; Oct &nbsp;5 22:17:58 storagequorum crmd: [7873]: notice:<br>&gt;&gt; do_election_count_vote: Election 15 (current: 15, owner: storagequorum):<br>&gt;&gt; Processed no-vote from storage1 (Peer is not part of our cluster)<br>&gt;&gt; ....<br>&gt;&gt; Oct &nbsp;6 03:49:58 storagequorum crmd: [18566]: notice:<br>&gt;&gt; do_election_count_vote: Election 989 (current: 1, owner: storage1):<br>&gt;&gt; Processed vote from storage1 (Peer is not part of our cluster)<br>&gt;&gt;<br>&gt;&gt; Around 1000 election cycles occur before storage1 is brought back into the<br>&gt;&gt; cluster. What is the cause of this and how can I modify my cluster<br>&gt;&gt; configuration to have nodes rejoin right away?<br>&gt;<br>&gt; Its not a configuration issue, you're hitting one or more bugs.<br>&gt;<br>&gt; You seem to be using 1.1.6, can I suggest an upgrade to 1.1.8? &nbsp;I<br>&gt; recall fixing related issues in the last month or so.<br>&gt; Also consider an updated corosync, there were some related fixes there too.<br>&gt;<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; Thanks,<br>&gt;&gt;<br>&gt;&gt; Andrew Martin<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; _______________________________________________<br>&gt;&gt; Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>&gt;&gt; http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>&gt;&gt;<br>&gt;&gt; Project Home: http://www.clusterlabs.org<br>&gt;&gt; Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>&gt;&gt; Bugs: http://bugs.clusterlabs.org<br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>&gt; http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>&gt;<br>&gt; Project Home: http://www.clusterlabs.org<br>&gt; Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>&gt; Bugs: http://bugs.clusterlabs.org<br>&gt;<br>&gt;<br>&gt; _______________________________________________<br>&gt; Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>&gt; http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>&gt;<br>&gt; Project Home: http://www.clusterlabs.org<br>&gt; Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>&gt; Bugs: http://bugs.clusterlabs.org<br>&gt;<br><br>_______________________________________________<br>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br><br>Project Home: http://www.clusterlabs.org<br>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>Bugs: http://bugs.clusterlabs.org<br></div><br></div></body></html>