[Pacemaker] Corosync won't recover when a node fails

Thu Oct 3 21:06:46 UTC 2013

On 2013-10-03 22:12, David Parker wrote:
> Thanks, Andrew.  The goal was to use either Pacemaker and Corosync 1.x
> from the Debain packages, or use both compiled from source.  So, with
> the compiled version, I was hoping to avoid CMAN.  However, it seems the
> packaged version of Pacemaker doesn't support CMAN anyway, so it's moot.
> 
> I rebuilt my VMs from scratch, re-installed Pacemaker and Corosync from
> the Debian packages, but I'm still having an odd problem.  Here is the
> config portion of my CIB:
> 
>     <crm_config>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"/>
>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="openais"/>
>         <nvpair id="cib-bootstrap-options-expected-quorum-votes"
> name="expected-quorum-votes" value="2"/>
>         <nvpair id="cib-bootstrap-options-stonith-enabled"
> name="stonith-enabled" value="false"/>
>         <nvpair id="cib-bootstrap-options-no-quorum-policy"
> name="no-quorum-policy" value="ignore"/>
>       </cluster_property_set>
>     </crm_config>
> 
> I set no-quorum-policy=ignore based on the documentation example for a
> 2-node cluster.  But when Pacemaker starts up on the first node, the
> DRBD resource is in slave mode and none of the other resources are
> started (they depend on DRBD being master), and I see these lines in the
> log:
> 
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: unpack_config: On
> loss of CCM Quorum: Ignore
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> nfs_fs   (test-vm-1 - blocked)
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> nfs_ip   (test-vm-1 - blocked)
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> nfs      (test-vm-1 - blocked)
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> drbd_r0:0        (test-vm-1)
> 
> I'm assuming the NFS resources show "blocked" because the resource they
> depend on is not in the correct state.
> 
> Even when the second node (test-vm-2) comes online, the state of these
> resources does not change.  I can shutdown and re-start Pacemaker over
> and over again on test-vm-2, but nothihg changes.  However... and this
> is where it gets weird... if I shut down Pacemaker on test-vm-1, then
> all of the resources immediately fail over to test-vm-2 and start
> correctly.  And I see these lines in the log:
> 
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: unpack_config: On
> loss of CCM Quorum: Ignore
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: stage6: Scheduling
> Node test-vm-1 for shutdown
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start  
> nfs_fs   (test-vm-2)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start  
> nfs_ip   (test-vm-2)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start  
> nfs      (test-vm-2)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Stop  
>  drbd_r0:0        (test-vm-1)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Promote
> drbd_r0:1        (Slave -> Master test-vm-2)
> 
> After that, I can generally move the resources back and forth, and even
> fail them over by hard-failing a node, without any problems.  The real
> problem is that this isn't consistent, though.  Every once in a while,
> I'll hard-fail a node and the other one will go into this "stuck" state
> where Pacemaker knows it lost a node, but DRBD will stay in slave mode
> and the other resources will never start.  It seems to happen quite
> randomly.  Then, even if I restart Pacemaker on both nodes, or reboot
> them altogether, I run into the startup issue mentioned previously.
> 
> Any ideas?

Yes, share your complete resource configuration ;-)

Regards,
Andreas

> 
>     Thanks,
>     Dave
> 
> 
> 
> On Wed, Oct 2, 2013 at 1:01 AM, Andrew Beekhof <andrew at beekhof.net
> <mailto:andrew at beekhof.net>> wrote:
> 
> 
>     On 02/10/2013, at 5:24 AM, David Parker <dparker at utica.edu
>     <mailto:dparker at utica.edu>> wrote:
> 
>     > Thanks, I did a little Googling and found the git repository for pcs.
> 
>     pcs won't help you rebuild pacemaker with cman support (or corosync
>     2.x support) turned on though.
> 
> 
>     >  Is there any way to make a two-node cluster work with the stock
>     Debian packages, though?  It seems odd that this would be impossible.
> 
>     it really depends how the debian maintainers built pacemaker.
>     by the sounds of it, it only supports the pacemaker plugin mode for
>     corosync 1.x
> 
>     >
>     >
>     > On Tue, Oct 1, 2013 at 3:16 PM, Larry Brigman
>     <larry.brigman at gmail.com <mailto:larry.brigman at gmail.com>> wrote:
>     > pcs is another package you will need to install.
>     >
>     > On Oct 1, 2013 9:04 AM, "David Parker" <dparker at utica.edu
>     <mailto:dparker at utica.edu>> wrote:
>     > Hello,
>     >
>     > Sorry for the delay in my reply.  I've been doing a lot of
>     experimentation, but so far I've had no luck.
>     >
>     > Thanks for the suggestion, but it seems I'm not able to use CMAN.
>      I'm running Debian Wheezy with Corosync and Pacemaker installed via
>     apt-get.  When I installed CMAN and set up a cluster.conf file,
>     Pacemaker refused to start and said that CMAN was not supported.
>      When CMAN is not installed, Pacemaker starts up fine, but I see
>     these lines in the log:
>     >
>     > Sep 30 23:36:29 test-vm-1 crmd: [6941]: ERROR:
>     init_quorum_connection: The Corosync quorum API is not supported in
>     this build
>     > Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: ERROR:
>     pcmk_child_exit: Child process crmd exited (pid=6941, rc=100)
>     > Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: WARN:
>     pcmk_child_exit: Pacemaker child process crmd no longer wishes to be
>     respawned. Shutting ourselves down.
>     >
>     > So, then I checked to see which plugins are supported:
>     >
>     > # pacemakerd -F
>     > Pacemaker 1.1.7 (Build: ee0730e13d124c3d58f00016c3376a1de5323cff)
>     >  Supporting:  generated-manpages agent-manpages ncurses  heartbeat
>     corosync-plugin snmp libesmtp
>     >
>     > Am I correct in believing that this Pacemaker package has been
>     compiled without support for any quorum API?  If so, does anyone
>     know if there is a Debian package which has the correct support?
>     >
>     > I also tried compiling LibQB, Corosync and Pacemaker from source
>     via git, following the instructions documented here:
>     >
>     > http://clusterlabs.org/wiki/SourceInstall
>     >
>     > I was hopeful that this would work, because as I understand it,
>     Corosync 2.x no longer uses CMAN.  Everything compiled and started
>     fine, but the compiled version of Pacemaker did not include either
>     the 'crm' or 'pcs' commands.  Do I need to install something else in
>     order to get one of these?
>     >
>     > Any and all help is greatly appreciated!
>     >
>     >     Thanks,
>     >     Dave
>     >
>     >
>     > On Wed, Sep 25, 2013 at 6:08 AM, David Lang <david at lang.hm
>     <mailto:david at lang.hm>> wrote:
>     > the cluster is trying to reach a quarum (the majority of the nodes
>     talking to each other) and that is never going to happen with only
>     one node. so you have to disable this.
>     >
>     > try putting
>     > <cman two_node="1" expected_votes="1" transport="udpu"/>
>     > in your cluster.conf
>     >
>     > David Lang
>     >
>     >  On Tue, 24 Sep 2013, David Parker wrote:
>     >
>     > Date: Tue, 24 Sep 2013 11:48:59 -0400
>     > From: David Parker <dparker at utica.edu <mailto:dparker at utica.edu>>
>     > Reply-To: The Pacemaker cluster resource manager
>     >     <pacemaker at oss.clusterlabs.org
>     <mailto:pacemaker at oss.clusterlabs.org>>
>     > To: The Pacemaker cluster resource manager
>     <pacemaker at oss.clusterlabs.org <mailto:pacemaker at oss.clusterlabs.org>>
>     > Subject: Re: [Pacemaker] Corosync won't recover when a node fails
>     >
>     >
>     > I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and
>     Pacemaker
>     > installed from packages via apt-get, and there are no local
>     firewall rules
>     > in place:
>     >
>     > # iptables -L
>     > Chain INPUT (policy ACCEPT)
>     > target     prot opt source               destination
>     >
>     > Chain FORWARD (policy ACCEPT)
>     > target     prot opt source               destination
>     >
>     > Chain OUTPUT (policy ACCEPT)
>     > target     prot opt source               destination
>     >
>     >
>     > On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dparker at utica.edu
>     <mailto:dparker at utica.edu>> wrote:
>     >
>     > Hello,
>     >
>     > I have a 2-node cluster using Corosync and Pacemaker, where the
>     nodes are
>     > actually to VirtualBox VMs on the same physical machine.  I have some
>     > resources set up in Pacemaker, and everything works fine if I move
>     them in
>     > a controlled way with the "crm_resource -r <resource> --move
>     --node <node>"
>     > command.
>     >
>     > However, when I hard-fail one of the nodes via the "poweroff"
>     command in
>     > Virtual Box, which "pulls the plug" on the VM, the resources do
>     not move,
>     > and I see the following output in the log on the remaining node:
>     >
>     > Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the
>     OPERATIONAL
>     > state.
>     > Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new
>     > configuration.
>     > Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.
>     > Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0
>     monitor[31]
>     > (pid 8495)
>     > drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>     > deprecated and may be removed in a future release. See the man
>     page for
>     > details. To suppress this warning, set the "ignore_deprecation"
>     resource
>     > parameter to true.
>     > drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>     > deprecated and may be removed in a future release. See the man
>     page for
>     > details. To suppress this warning, set the "ignore_deprecation"
>     resource
>     > parameter to true.
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>     > /etc/drbd.conf role r0
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
>     > Secondary/Primary
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>     > /etc/drbd.conf cstate r0
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
>     Connected
>     > drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0 status:
>     Secondary/Primary
>     > Secondary Primary Connected
>     > Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on
>     > drbd_r0:0 for client 2506: pid 8495 exited with return code 0
>     > Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.
>     > Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.
>     > Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.
>     > Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.
>     > Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.
>     > Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.
>     > Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.
>     > Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.
>     > Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.
>     > Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster
>     > because of an operating system or network fault. The most common
>     cause of
>     > this message is that the local firewall is configured improperly.
>     > Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.
>     > Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.
>     > Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster
>     > because of an operating system or network fault. The most common
>     cause of
>     > this message is that the local firewall is configured improperly.
>     > Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.
>     > Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.
>     > Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster
>     > because of an operating system or network fault. The most common
>     cause of
>     > this message is that the local firewall is configured improperly.
>     > Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.
>     >
>     > Those last 3 messages just repeat over and over, the cluster never
>     > recovers, and the resources never move.  "crm_mon" reports that the
>     > resources are still running on the dead node, and shows no
>     indication that
>     > anything has gone wrong.
>     >
>     > Does anyone know what the issue could be?  My expectation was that the
>     > remaining node would become the sole member of the cluster, take
>     over the
>     > resources, and everything would keep running.
>     >
>     > For reference, my corosync.conf file is below:
>     >
>     > compatibility: whitetank
>     >
>     > totem {
>     >         version: 2
>     >         secauth: off
>     >         interface {
>     >                 member {
>     >                         memberaddr: 192.168.25.201
>     >                 }
>     >                 member {
>     >                         memberaddr: 192.168.25.202
>     >                  }
>     >                 ringnumber: 0
>     >                 bindnetaddr: 192.168.25.0
>     >                 mcastport: 5405
>     >         }
>     >         transport: udpu
>     > }
>     >
>     > logging {
>     >         fileline: off
>     >         to_logfile: yes
>     >         to_syslog: yes
>     >         debug: on
>     >         logfile: /var/log/cluster/corosync.log
>     >         timestamp: on
>     >         logger_subsys {
>     >                 subsys: AMF
>     >                 debug: on
>     >         }
>     > }
>     >
>     >
>     > Thanks!
>     > Dave
>     >
>     > --
>     > Dave Parker
>     > Systems Administrator
>     > Utica College
>     > Integrated Information Technology Services
>     > (315) 792-3229
>     > Registered Linux User #408177
>     >
>     >
>     >
>     >
>     >
>     > _______________________________________________
>     >
>     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     >
>     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     >
>     >
>     > Project Home: http://www.clusterlabs.org
>     >
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     >
>     > Bugs: http://bugs.clusterlabs.org
>     >
>     >
>     > _______________________________________________
>     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     > Project Home: http://www.clusterlabs.org
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     > Bugs: http://bugs.clusterlabs.org
>     >
>     >
>     >
>     >
>     > --
>     > Dave Parker
>     > Systems Administrator
>     > Utica College
>     > Integrated Information Technology Services
>     > (315) 792-3229 <tel:%28315%29%20792-3229>
>     > Registered Linux User #408177
>     >
>     > _______________________________________________
>     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     > Project Home: http://www.clusterlabs.org
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     > Bugs: http://bugs.clusterlabs.org
>     >
>     >
>     > _______________________________________________
>     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     > Project Home: http://www.clusterlabs.org
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     > Bugs: http://bugs.clusterlabs.org
>     >
>     >
>     >
>     >
>     > --
>     > Dave Parker
>     > Systems Administrator
>     > Utica College
>     > Integrated Information Technology Services
>     > (315) 792-3229 <tel:%28315%29%20792-3229>
>     > Registered Linux User #408177
>     > _______________________________________________
>     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     > Project Home: http://www.clusterlabs.org
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     > Bugs: http://bugs.clusterlabs.org
> 
> 
>     _______________________________________________
>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
>     Project Home: http://www.clusterlabs.org
>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> -- 
> Dave Parker
> Systems Administrator
> Utica College
> Integrated Information Technology Services
> (315) 792-3229
> Registered Linux User #408177
> 
> 
> This body part will be downloaded on demand.
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131003/5889282c/attachment-0004.sig>