[Pacemaker] STONITHed node cannot rejoin cluster for over 1000 elections

Mon Oct 15 10:31:51 UTC 2012

On Sat, Oct 13, 2012 at 1:53 AM, Andrew Martin <amartin at xes-inc.com> wrote:
> Hi Andrew,
>
> Thanks, I'll compile Pacemaker 1.1.8 and Corosync 1.4.4. Can I leave
> cluster-glue and resource-agents at the default versions provided with
> Ubuntu 12.04 (1.0.8 and 3.9.2 respectively), or do I need to upgrade them as
> well?

Should be fine. You will need to obtain a recent libqb build though.

>
> Andrew
>
> ________________________________
> From: "Andrew Beekhof" <andrew at beekhof.net>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, October 11, 2012 8:08:13 PM
> Subject: Re: [Pacemaker] STONITHed node cannot rejoin cluster for over 1000
> elections
>
>
> On Fri, Oct 12, 2012 at 7:12 AM, Andrew Martin <amartin at xes-inc.com> wrote:
>> Hello,
>>
>>
>> I am running a 3-node Corosync+Pacemaker cluster with 2 "real" nodes
>> running resources (storage0 and storage1) and a quorum node (storagequorum)
>> in standby mode. All of the nodes run Ubuntu 12.04 server amd64. There are
>> two corosync rings:
>>         rrp_mode: active
>>
>>         interface {
>>                 # the common LAN
>>                 ringnumber: 0
>>                 bindnetaddr: 10.10.0.0
>>                 mcastaddr: 226.94.1.1
>>                 mcastport: 5405
>>         }
>>
>>         interface {
>>                 # the STONITH network
>>                 ringnumber: 1
>>                 bindnetaddr: 192.168.7.0
>>                 mcastaddr: 226.94.1.2
>>                 mcastport: 5407
>>         }
>>
>> DRBD is configured to use /usr/lib/drbd/crm-fence-peer.sh to fence the
>> peer node.
>>
>> There are 3 active interfaces on storage[01]: the common LAN, the STONITH
>> network, and the DRBD replication link. The storagequorum node only has the
>> common LAN and STONITH networks. When looking through the logs, note that
>> the IP addresses for each node are assigned as follows:
>>
>> storage0: xxx.xxx.xxx.148
>> storage1: xxx.xxx.xxx.149
>> storagequorum: xxx.xxx.xxx.24
>>
>> Storage0 and storage1 also had a secondary link to the common LAN which
>> has now been disabled (xxx.xxx.xxx.162 and xxx.xxx.xxx.163 respectively).
>> You still may see it show up in the log, e.g.
>> Oct  5 22:17:39 storagequorum crmd: [7873]: info: crm_update_peer: Node
>> storage1: id=587281418 state=lost addr=r(0) ip(10.10.1.163) r(1)
>> ip(192.168.7.149)  votes=1 born=1828352 seen=1828368
>> proc=00000000000000000000000000111312 (new)
>>
>> Here is the CIB configuration:
>> http://pastebin.com/6TPkWtbt
>>
>> As you can see, the drbd-fence-by-handler-ms_drbd_drives primitive keeps
>> getting added into the configuration but doesn't seem to get removed.
>>
>>
>> I recently tried running a failover test by performing "crm resource
>> migrate g_store" when the resources were running on storage1. The
>> ocf:heartbeat:exportfs resources failed to stop due to
>> wait_for_leasetime_on_stop being true (I am going to set this to false now
>> because I don't need NFSv4 support). Recognizing this problem, the cluster
>> correctly STONITHed storage1 and migrated the resources to storage0.
>> However, once storage1 finished rebooting, it was unable to join the cluster
>> (crm_mon shows it as [offline]). I have uploaded the syslog from the DC
>> (storagequorum) from this time period here:
>> http://sources.xes-inc.com/downloads/storagequorum.syslog.log . Initially
>> after the STONITH it seems like storage1 rejoins the cluster successfully:
>> Oct  5 22:17:39 storagequorum cib: [7869]: info: crm_update_peer: Node
>> storage1: id=352400394 state=member (new) addr=r(0) ip(10.10.1.149) r(1)
>> ip(192.168.7.149)  (new) votes=1 born=1828384 seen=1828384
>> proc=00000000000000000000000000111312
>>
>> However, later it becomes apparent that it cannot join:
>> Oct  5 22:17:58 storagequorum crmd: [7873]: notice:
>> do_election_count_vote: Election 15 (current: 15, owner: storagequorum):
>> Processed no-vote from storage1 (Peer is not part of our cluster)
>> ....
>> Oct  6 03:49:58 storagequorum crmd: [18566]: notice:
>> do_election_count_vote: Election 989 (current: 1, owner: storage1):
>> Processed vote from storage1 (Peer is not part of our cluster)
>>
>> Around 1000 election cycles occur before storage1 is brought back into the
>> cluster. What is the cause of this and how can I modify my cluster
>> configuration to have nodes rejoin right away?
>
> Its not a configuration issue, you're hitting one or more bugs.
>
> You seem to be using 1.1.6, can I suggest an upgrade to 1.1.8?  I
> recall fixing related issues in the last month or so.
> Also consider an updated corosync, there were some related fixes there too.
>
>>
>>
>> Thanks,
>>
>> Andrew Martin
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>