[Pacemaker] Corosync won't recover when a node fails
David Lang
david at lang.hm
Wed Sep 25 10:08:12 UTC 2013
the cluster is trying to reach a quarum (the majority of the nodes talking to
each other) and that is never going to happen with only one node. so you have to
disable this.
try putting
<cman two_node="1" expected_votes="1" transport="udpu"/>
in your cluster.conf
David Lang
On Tue, 24 Sep 2013, David Parker wrote:
> Date: Tue, 24 Sep 2013 11:48:59 -0400
> From: David Parker <dparker at utica.edu>
> Reply-To: The Pacemaker cluster resource manager
> <pacemaker at oss.clusterlabs.org>
> To: The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Corosync won't recover when a node fails
>
> I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and Pacemaker
> installed from packages via apt-get, and there are no local firewall rules
> in place:
>
> # iptables -L
> Chain INPUT (policy ACCEPT)
> target prot opt source destination
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source destination
>
> Chain OUTPUT (policy ACCEPT)
> target prot opt source destination
>
>
> On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dparker at utica.edu> wrote:
>
>> Hello,
>>
>> I have a 2-node cluster using Corosync and Pacemaker, where the nodes are
>> actually to VirtualBox VMs on the same physical machine. I have some
>> resources set up in Pacemaker, and everything works fine if I move them in
>> a controlled way with the "crm_resource -r <resource> --move --node <node>"
>> command.
>>
>> However, when I hard-fail one of the nodes via the "poweroff" command in
>> Virtual Box, which "pulls the plug" on the VM, the resources do not move,
>> and I see the following output in the log on the remaining node:
>>
>> Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the OPERATIONAL
>> state.
>> Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new
>> configuration.
>> Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.
>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0 monitor[31]
>> (pid 8495)
>> drbd[8495]: 2013/09/24_11:20:31 WARNING: This resource agent is
>> deprecated and may be removed in a future release. See the man page for
>> details. To suppress this warning, set the "ignore_deprecation" resource
>> parameter to true.
>> drbd[8495]: 2013/09/24_11:20:31 WARNING: This resource agent is
>> deprecated and may be removed in a future release. See the man page for
>> details. To suppress this warning, set the "ignore_deprecation" resource
>> parameter to true.
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>> /etc/drbd.conf role r0
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Command output:
>> Secondary/Primary
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>> /etc/drbd.conf cstate r0
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Command output: Connected
>> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0 status: Secondary/Primary
>> Secondary Primary Connected
>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on
>> drbd_r0:0 for client 2506: pid 8495 exited with return code 0
>> Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.
>> Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster
>> because of an operating system or network fault. The most common cause of
>> this message is that the local firewall is configured improperly.
>> Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster
>> because of an operating system or network fault. The most common cause of
>> this message is that the local firewall is configured improperly.
>> Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster
>> because of an operating system or network fault. The most common cause of
>> this message is that the local firewall is configured improperly.
>> Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.
>>
>> Those last 3 messages just repeat over and over, the cluster never
>> recovers, and the resources never move. "crm_mon" reports that the
>> resources are still running on the dead node, and shows no indication that
>> anything has gone wrong.
>>
>> Does anyone know what the issue could be? My expectation was that the
>> remaining node would become the sole member of the cluster, take over the
>> resources, and everything would keep running.
>>
>> For reference, my corosync.conf file is below:
>>
>> compatibility: whitetank
>>
>> totem {
>> version: 2
>> secauth: off
>> interface {
>> member {
>> memberaddr: 192.168.25.201
>> }
>> member {
>> memberaddr: 192.168.25.202
>> }
>> ringnumber: 0
>> bindnetaddr: 192.168.25.0
>> mcastport: 5405
>> }
>> transport: udpu
>> }
>>
>> logging {
>> fileline: off
>> to_logfile: yes
>> to_syslog: yes
>> debug: on
>> logfile: /var/log/cluster/corosync.log
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: on
>> }
>> }
>>
>>
>> Thanks!
>> Dave
>>
>> --
>> Dave Parker
>> Systems Administrator
>> Utica College
>> Integrated Information Technology Services
>> (315) 792-3229
>> Registered Linux User #408177
>>
>
>
>
>
-------------- next part --------------
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list