[Pacemaker] Corosync won't recover when a node fails

Tue Oct 1 19:16:22 UTC 2013

pcs is another package you will need to install.
On Oct 1, 2013 9:04 AM, "David Parker" <dparker at utica.edu> wrote:

> Hello,
>
> Sorry for the delay in my reply.  I've been doing a lot of
> experimentation, but so far I've had no luck.
>
> Thanks for the suggestion, but it seems I'm not able to use CMAN.  I'm
> running Debian Wheezy with Corosync and Pacemaker installed via apt-get.
>  When I installed CMAN and set up a cluster.conf file, Pacemaker refused to
> start and said that CMAN was not supported.  When CMAN is not installed,
> Pacemaker starts up fine, but I see these lines in the log:
>
> Sep 30 23:36:29 test-vm-1 crmd: [6941]: ERROR: init_quorum_connection: The
> Corosync quorum API is not supported in this build
> Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: ERROR: pcmk_child_exit:
> Child process crmd exited (pid=6941, rc=100)
> Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: WARN: pcmk_child_exit:
> Pacemaker child process crmd no longer wishes to be respawned. Shutting
> ourselves down.
>
> So, then I checked to see which plugins are supported:
>
> # pacemakerd -F
> Pacemaker 1.1.7 (Build: ee0730e13d124c3d58f00016c3376a1de5323cff)
>  Supporting:  generated-manpages agent-manpages ncurses  heartbeat
> corosync-plugin snmp libesmtp
>
> Am I correct in believing that this Pacemaker package has been compiled
> without support for any quorum API?  If so, does anyone know if there is a
> Debian package which has the correct support?
>
> I also tried compiling LibQB, Corosync and Pacemaker from source via git,
> following the instructions documented here:
>
> http://clusterlabs.org/wiki/SourceInstall
>
> I was hopeful that this would work, because as I understand it, Corosync
> 2.x no longer uses CMAN.  Everything compiled and started fine, but the
> compiled version of Pacemaker did not include either the 'crm' or 'pcs'
> commands.  Do I need to install something else in order to get one of these?
>
> Any and all help is greatly appreciated!
>
>     Thanks,
>     Dave
>
>
> On Wed, Sep 25, 2013 at 6:08 AM, David Lang <david at lang.hm> wrote:
>
>> the cluster is trying to reach a quarum (the majority of the nodes
>> talking to each other) and that is never going to happen with only one
>> node. so you have to disable this.
>>
>> try putting
>> <cman two_node="1" expected_votes="1" transport="udpu"/>
>> in your cluster.conf
>>
>> David Lang
>>
>>  On Tue, 24 Sep 2013, David Parker wrote:
>>
>>  Date: Tue, 24 Sep 2013 11:48:59 -0400
>>> From: David Parker <dparker at utica.edu>
>>> Reply-To: The Pacemaker cluster resource manager
>>>     <pacemaker at oss.clusterlabs.org**>
>>> To: The Pacemaker cluster resource manager <
>>> pacemaker at oss.clusterlabs.org**>
>>> Subject: Re: [Pacemaker] Corosync won't recover when a node fails
>>>
>>>
>>> I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and Pacemaker
>>> installed from packages via apt-get, and there are no local firewall
>>> rules
>>> in place:
>>>
>>> # iptables -L
>>> Chain INPUT (policy ACCEPT)
>>> target     prot opt source               destination
>>>
>>> Chain FORWARD (policy ACCEPT)
>>> target     prot opt source               destination
>>>
>>> Chain OUTPUT (policy ACCEPT)
>>> target     prot opt source               destination
>>>
>>>
>>> On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dparker at utica.edu>
>>> wrote:
>>>
>>>  Hello,
>>>>
>>>> I have a 2-node cluster using Corosync and Pacemaker, where the nodes
>>>> are
>>>> actually to VirtualBox VMs on the same physical machine.  I have some
>>>> resources set up in Pacemaker, and everything works fine if I move them
>>>> in
>>>> a controlled way with the "crm_resource -r <resource> --move --node
>>>> <node>"
>>>> command.
>>>>
>>>> However, when I hard-fail one of the nodes via the "poweroff" command in
>>>> Virtual Box, which "pulls the plug" on the VM, the resources do not
>>>> move,
>>>> and I see the following output in the log on the remaining node:
>>>>
>>>> Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the OPERATIONAL
>>>> state.
>>>> Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new
>>>> configuration.
>>>> Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.
>>>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0 monitor[31]
>>>> (pid 8495)
>>>> drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>>>> deprecated and may be removed in a future release. See the man page for
>>>> details. To suppress this warning, set the "ignore_deprecation" resource
>>>> parameter to true.
>>>> drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>>>> deprecated and may be removed in a future release. See the man page for
>>>> details. To suppress this warning, set the "ignore_deprecation" resource
>>>> parameter to true.
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>>>> /etc/drbd.conf role r0
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
>>>> Secondary/Primary
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>>>> /etc/drbd.conf cstate r0
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output: Connected
>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0 status: Secondary/Primary
>>>> Secondary Primary Connected
>>>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on
>>>> drbd_r0:0 for client 2506: pid 8495 exited with return code 0
>>>> Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.
>>>> Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.
>>>> Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.
>>>> Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.
>>>> Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.
>>>> Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.
>>>> Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.
>>>> Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.
>>>> Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.
>>>> Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster
>>>> because of an operating system or network fault. The most common cause
>>>> of
>>>> this message is that the local firewall is configured improperly.
>>>> Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.
>>>> Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.
>>>> Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster
>>>> because of an operating system or network fault. The most common cause
>>>> of
>>>> this message is that the local firewall is configured improperly.
>>>> Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.
>>>> Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.
>>>> Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster
>>>> because of an operating system or network fault. The most common cause
>>>> of
>>>> this message is that the local firewall is configured improperly.
>>>> Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.
>>>>
>>>> Those last 3 messages just repeat over and over, the cluster never
>>>> recovers, and the resources never move.  "crm_mon" reports that the
>>>> resources are still running on the dead node, and shows no indication
>>>> that
>>>> anything has gone wrong.
>>>>
>>>> Does anyone know what the issue could be?  My expectation was that the
>>>> remaining node would become the sole member of the cluster, take over
>>>> the
>>>> resources, and everything would keep running.
>>>>
>>>> For reference, my corosync.conf file is below:
>>>>
>>>> compatibility: whitetank
>>>>
>>>> totem {
>>>>         version: 2
>>>>         secauth: off
>>>>         interface {
>>>>                 member {
>>>>                         memberaddr: 192.168.25.201
>>>>                 }
>>>>                 member {
>>>>                         memberaddr: 192.168.25.202
>>>>                  }
>>>>                 ringnumber: 0
>>>>                 bindnetaddr: 192.168.25.0
>>>>                 mcastport: 5405
>>>>         }
>>>>         transport: udpu
>>>> }
>>>>
>>>> logging {
>>>>         fileline: off
>>>>         to_logfile: yes
>>>>         to_syslog: yes
>>>>         debug: on
>>>>         logfile: /var/log/cluster/corosync.log
>>>>         timestamp: on
>>>>         logger_subsys {
>>>>                 subsys: AMF
>>>>                 debug: on
>>>>         }
>>>> }
>>>>
>>>>
>>>> Thanks!
>>>> Dave
>>>>
>>>> --
>>>> Dave Parker
>>>> Systems Administrator
>>>> Utica College
>>>> Integrated Information Technology Services
>>>> (315) 792-3229
>>>> Registered Linux User #408177
>>>>
>>>>
>>>
>>>
>>>
>> _______________________________________________
>>
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>>
>> Project Home: http://www.clusterlabs.org
>>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
>
> --
> Dave Parker
> Systems Administrator
> Utica College
> Integrated Information Technology Services
> (315) 792-3229
> Registered Linux User #408177
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131001/61c6bb5c/attachment.htm>