[Pacemaker] Corosync won't recover when a node fails

Wed Sep 25 10:08:12 UTC 2013

the cluster is trying to reach a quarum (the majority of the nodes talking to 
each other) and that is never going to happen with only one node. so you have to 
disable this.

try putting
<cman two_node="1" expected_votes="1" transport="udpu"/>
in your cluster.conf

David Lang

  On Tue, 24 Sep 2013, David Parker wrote:

> Date: Tue, 24 Sep 2013 11:48:59 -0400
> From: David Parker <dparker at utica.edu>
> Reply-To: The Pacemaker cluster resource manager
>     <pacemaker at oss.clusterlabs.org>
> To: The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Corosync won't recover when a node fails
> 
> I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and Pacemaker
> installed from packages via apt-get, and there are no local firewall rules
> in place:
>
> # iptables -L
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
>
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
>
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
>
>
> On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dparker at utica.edu> wrote:
>
>> Hello,
>>
>> I have a 2-node cluster using Corosync and Pacemaker, where the nodes are
>> actually to VirtualBox VMs on the same physical machine.  I have some
>> resources set up in Pacemaker, and everything works fine if I move them in
>> a controlled way with the "crm_resource -r <resource> --move --node <node>"
>> command.
>>
>> However, when I hard-fail one of the nodes via the "poweroff" command in
>> Virtual Box, which "pulls the plug" on the VM, the resources do not move,
>> and I see the following output in the log on the remaining node:
>>
>> Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the OPERATIONAL
>> state.
>> Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new
>> configuration.
>> Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.
>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0 monitor[31]
>> (pid 8495)
>> drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>> deprecated and may be removed in a future release. See the man page for
>> details. To suppress this warning, set the "ignore_deprecation" resource
>> parameter to true.
>> drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>> deprecated and may be removed in a future release. See the man page for
>> details. To suppress this warning, set the "ignore_deprecation" resource
>> parameter to true.
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>> /etc/drbd.conf role r0
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
>> Secondary/Primary
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>> /etc/drbd.conf cstate r0
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output: Connected
>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0 status: Secondary/Primary
>> Secondary Primary Connected
>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on
>> drbd_r0:0 for client 2506: pid 8495 exited with return code 0
>> Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.
>> Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster
>> because of an operating system or network fault. The most common cause of
>> this message is that the local firewall is configured improperly.
>> Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster
>> because of an operating system or network fault. The most common cause of
>> this message is that the local firewall is configured improperly.
>> Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.
>> Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.
>> Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster
>> because of an operating system or network fault. The most common cause of
>> this message is that the local firewall is configured improperly.
>> Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.
>>
>> Those last 3 messages just repeat over and over, the cluster never
>> recovers, and the resources never move.  "crm_mon" reports that the
>> resources are still running on the dead node, and shows no indication that
>> anything has gone wrong.
>>
>> Does anyone know what the issue could be?  My expectation was that the
>> remaining node would become the sole member of the cluster, take over the
>> resources, and everything would keep running.
>>
>> For reference, my corosync.conf file is below:
>>
>> compatibility: whitetank
>>
>> totem {
>>         version: 2
>>         secauth: off
>>         interface {
>>                 member {
>>                         memberaddr: 192.168.25.201
>>                 }
>>                 member {
>>                         memberaddr: 192.168.25.202
>>                  }
>>                 ringnumber: 0
>>                 bindnetaddr: 192.168.25.0
>>                 mcastport: 5405
>>         }
>>         transport: udpu
>> }
>>
>> logging {
>>         fileline: off
>>         to_logfile: yes
>>         to_syslog: yes
>>         debug: on
>>         logfile: /var/log/cluster/corosync.log
>>         timestamp: on
>>         logger_subsys {
>>                 subsys: AMF
>>                 debug: on
>>         }
>> }
>>
>>
>> Thanks!
>> Dave
>>
>> --
>> Dave Parker
>> Systems Administrator
>> Utica College
>> Integrated Information Technology Services
>> (315) 792-3229
>> Registered Linux User #408177
>>
>
>
>
>
-------------- next part --------------
_______________________________________________

Pacemaker mailing list: Pacemaker at oss.clusterlabs.org

http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org