[Pacemaker] STONITHed node cannot rejoin cluster for over 1000 elections

Thu Oct 11 16:12:02 EDT 2012

Hello, 

I am running a 3-node Corosync+Pacemaker cluster with 2 "real" nodes running resources (storage0 and storage1) and a quorum node (storagequorum) in standby mode. All of the nodes run Ubuntu 12.04 server amd64. There are two corosync rings:
        rrp_mode: active

        interface {
                # the common LAN
                ringnumber: 0
                bindnetaddr: 10.10.0.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }

        interface {
                # the STONITH network
                ringnumber: 1 
                bindnetaddr: 192.168.7.0
                mcastaddr: 226.94.1.2
                mcastport: 5407
        }

DRBD is configured to use /usr/lib/drbd/crm-fence-peer.sh to fence the peer node.

There are 3 active interfaces on storage[01]: the common LAN, the STONITH network, and the DRBD replication link. The storagequorum node only has the common LAN and STONITH networks. When looking through the logs, note that the IP addresses for each node are assigned as follows:

storage0: xxx.xxx.xxx.148
storage1: xxx.xxx.xxx.149
storagequorum: xxx.xxx.xxx.24

Storage0 and storage1 also had a secondary link to the common LAN which has now been disabled (xxx.xxx.xxx.162 and xxx.xxx.xxx.163 respectively). You still may see it show up in the log, e.g.
Oct  5 22:17:39 storagequorum crmd: [7873]: info: crm_update_peer: Node storage1: id=587281418 state=lost addr=r(0) ip(10.10.1.163) r(1) ip(192.168.7.149)  votes=1 born=1828352 seen=1828368 proc=00000000000000000000000000111312 (new)

Here is the CIB configuration: 
http://pastebin.com/6TPkWtbt

As you can see, the drbd-fence-by-handler-ms_drbd_drives primitive keeps getting added into the configuration but doesn't seem to get removed. 

I recently tried running a failover test by performing "crm resource migrate g_store" when the resources were running on storage1. The ocf:heartbeat:exportfs resources failed to stop due to wait_for_leasetime_on_stop being true (I am going to set this to false now because I don't need NFSv4 support). Recognizing this problem, the cluster correctly STONITHed storage1 and migrated the resources to storage0. However, once storage1 finished rebooting, it was unable to join the cluster (crm_mon shows it as [offline]). I have uploaded the syslog from the DC (storagequorum) from this time period here: http://sources.xes-inc.com/downloads/storagequorum.syslog.log . Initially after the STONITH it seems like storage1 rejoins the cluster successfully:
Oct  5 22:17:39 storagequorum cib: [7869]: info: crm_update_peer: Node storage1: id=352400394 state=member (new) addr=r(0) ip(10.10.1.149) r(1) ip(192.168.7.149)  (new) votes=1 born=1828384 seen=1828384 proc=00000000000000000000000000111312

However, later it becomes apparent that it cannot join:
Oct  5 22:17:58 storagequorum crmd: [7873]: notice: do_election_count_vote: Election 15 (current: 15, owner: storagequorum): Processed no-vote from storage1 (Peer is not part of our cluster)
....
Oct  6 03:49:58 storagequorum crmd: [18566]: notice: do_election_count_vote: Election 989 (current: 1, owner: storage1): Processed vote from storage1 (Peer is not part of our cluster)

Around 1000 election cycles occur before storage1 is brought back into the cluster. What is the cause of this and how can I modify my cluster configuration to have nodes rejoin right away?

Thanks,

Andrew Martin