[Pacemaker] Pacemaker remote nodes, naming, and attributes

Mon Jul 15 18:16:11 EDT 2013

----- Original Message -----
> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, July 11, 2013 12:09:52 PM
> Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> 
> So I've switched my cluster to be asymmetric...
> 
> The remote nodes don't appear (although the VMs start, and pacemaker_remote
> runs on them). This is a problem.
> 
> Temporarily switch to symmetric without restarting everything, and the remote
> nodes appear in the cluster status. (But I'd have to do more reconfiguration
> here.) Seems like the "remote node" resource won't start on the VM without
> some imaginary location constraint to help it?

ha, yep that's exactly what is going on.  I can fix that. I'll post a patch soon.

> 
> Switch back to asymmetric, and the remote nodes are "offline" or "UNCLEAN",
> and pengine is dumping core. I didn't really expect it to work well, and

Thanks for the bt, can you by any chance provide the pengine files that caused the core crash. Everything +- an hour of Thu Jul 11 12:53:00 2013 would probably do it.

> such drastic changes should probably be done in maintenance mode, at least,

well, it should be "safe" to do that sort of change without having to worry about segfaults and things of that nature.

-- Vossel

> but in case it is of help in diagnosing the original problem: "pcs status"
> starts like this:
> 
> # pcs status
> Last updated: Thu Jul 11 12:53:00 2013
> Last change: Thu Jul 11 12:43:18 2013 via crmd on cvmh02
> Stack: cman
> Current DC: cvmh03 - partition with quorum
> Version: 1.1.10-3.el6.ccni-bead5ad
> 11 Nodes configured, unknown expected votes
> 69 Resources configured.
> 
> 
> Node db02:vm-db02: UNCLEAN (offline)
> Node ldap01: UNCLEAN (offline)
> Node ldap02: UNCLEAN (offline)
> Node swbuildsl6: UNCLEAN (offline)
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ]
> OFFLINE: [ ldap01:vm-ldap01 ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> 
> 
> Traceback:
> 
> # gdb /usr/libexec/pacemaker/pengine /var/lib/heartbeat/cores/cor
> e.22873
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
> Copyright (C) 2010 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later < http://gnu.org/licenses/gpl.html
> >
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> < http://www.gnu.org/software/gdb/bugs/ >...
> Reading symbols from /usr/libexec/pacemaker/pengine...Reading symbols from
> /usr/
> lib/debug/usr/libexec/pacemaker/pengine.debug...done.
> done.
> [New Thread 22873]
> ...
> 
> (gdb) where
> #0 0x0000003f5b00d6ec in sort_rsc_process_order (a=<value optimized out>,
> b=<value optimized out>, data=<value optimized out>) at allocate.c:1043
> #1 0x0000003b67a36979 in ?? () from /lib64/libglib-2.0.so.0
> #2 0x0000003b67a3691d in ?? () from /lib64/libglib-2.0.so.0
> #3 0x0000003b67a3691d in ?? () from /lib64/libglib-2.0.so.0
> #4 0x0000003b67a3691d in ?? () from /lib64/libglib-2.0.so.0
> #5 0x0000003b67a3692e in ?? () from /lib64/libglib-2.0.so.0
> #6 0x0000003f5b01267a in stage5 (data_set=0x7fff67a6b260) at allocate.c:1149
> #7 0x0000003f5b009b71 in do_calculations (data_set=0x7fff67a6b260,
> xml_input=<value optimized out>, now=<value optimized out>)
> at pengine.c:252
> #8 0x0000003f5b00a7b2 in process_pe_message (msg=0xbeb710, xml_data=0xbec0b0,
> sender=0xbe2f10) at pengine.c:126
> #9 0x000000000040142f in pe_ipc_dispatch (qbc=<value optimized out>,
> data=<value optimized out>, size=29292) at main.c:79
> #10 0x0000003b6ae0e874 in ?? () from /usr/lib64/libqb.so.0
> #11 0x0000003b6ae0ebc4 in qb_ipcs_dispatch_connection_request ()
> from /usr/lib64/libqb.so.0
> #12 0x0000003f5982b0a0 in gio_read_socket (gio=<value optimized out>,
> condition=G_IO_IN, data=0xbe8540) at mainloop.c:453
> #13 0x0000003b67a38f0e in g_main_context_dispatch ()
> from /lib64/libglib-2.0.so.0
> #14 0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0
> #15 0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #16 0x0000000000401738 in main (argc=1, argv=0x7fff67a6b858) at main.c:182
> 
> 
> After a reboot, I had to remove the nodes ldap01, ldap02, swbuildsl6. My
> cluster is again working in asymmetric mode, except the remote nodes are not
> appearing online:
> 
> # pcs status
> Last updated: Thu Jul 11 13:08:22 2013
> Last change: Thu Jul 11 13:01:39 2013 via crm_node on cvmh01
> Stack: cman
> Current DC: cvmh02 - partition with quorum
> Version: 1.1.10-3.el6.ccni-bead5ad
> 8 Nodes configured, unknown expected votes
> 54 Resources configured.
> 
> 
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> OFFLINE: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02
> swbuildsl6:vm-swbuildsl6 ]
> 
> Full list of resources:
> 
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Clone Set: c-p-libvirtd [p-libvirtd]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Clone Set: c-watch-ib0 [p-watch-ib0]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Clone Set: c-fs-gpfs [p-fs-gpfs]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh01
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> DummyOnVM (ocf::pacemaker:Dummy): Stopped
> 
> 
> 
> 
> 
> On Wed, Jul 10, 2013 at 6:43 PM, Lindsay Todd < rltodd.ml1 at gmail.com > wrote:
> 
> 
> 
> Yes, it avoids the crashes. Thanks! But I am still seeing spurious VM
> migrations/shutdowns when I stop/start a VM with a remote pacemaker (similar
> to my last update, only no core dumped while fencing, nor indeed does any
> fencing happen, even though I've now verified that fence_node works again.
> 
> 
> On Wed, Jul 10, 2013 at 2:12 PM, David Vossel < dvossel at redhat.com > wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > To: "The Pacemaker cluster resource manager" <
> > pacemaker at oss.clusterlabs.org >
> > Sent: Wednesday, July 10, 2013 12:11:00 PM
> > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > 
> > Hmm, I'll still submit the bug report, but it seems like crmd is dumping
> > core
> > while attempting to fence a node. If I use fence_node to fence a real
> > cluster node, that also causes crmd to dump core. But apart from that, I
> > don't really see why pacemaker is trying to fence anything.
> 
> This should solve the crashes you are seeing.
> 
> https://github.com/ClusterLabs/pacemaker/commit/97dd3b05db867c4674fa4780802bba54c63bd06d
> 
> -- Vossel
> 
> > 
> > 
> > On Wed, Jul 10, 2013 at 12:42 PM, Lindsay Todd < rltodd.ml1 at gmail.com >
> > wrote:
> > 
> > 
> > 
> > Thanks! But there is still a problem.
> > 
> > I am now working from the master branch and building RPMs (well, I have to
> > also rebuild from the srpm to change the build number, since the RPMs built
> > directly are always 1.1.10-1). The patch is in the git log, and indeed
> > things are better ... But I still see the spurious VMs shutting down. What
> > is much improved is that they do get restarted, and basically I end up in
> > the state I want to be. Can almost live with this, and I was going to start
> > changing my cluster config to be asymmetric when I noticed the in the midst
> > of the spurious transitions, crmd is dumping core.
> > 
> > So I'll append another crm_report to bug 5164, as well as a gdb traceback.
> > 
> > 
> > On Fri, Jul 5, 2013 at 5:06 PM, David Vossel < dvossel at redhat.com > wrote:
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "David Vossel" < dvossel at redhat.com >
> > > To: "The Pacemaker cluster resource manager" <
> > > pacemaker at oss.clusterlabs.org >
> > > Sent: Wednesday, July 3, 2013 4:20:37 PM
> > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > 
> > > ----- Original Message -----
> > > > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > > > To: "The Pacemaker cluster resource manager"
> > > > < pacemaker at oss.clusterlabs.org >
> > > > Sent: Wednesday, July 3, 2013 2:12:05 PM
> > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > > 
> > > > Well, I'm not getting failures right now simply with attributes, but I
> > > > can
> > > > induce a failure by stopping the vm-db02 (it puts db02 into an unclean
> > > > state, and attempts to migrate the unrelated vm-compute-test). I've
> > > > collected the commands from my latest interactions, a crm_report, and a
> > > > gdb
> > > > traceback from the core file that crmd dumped, into bug 5164.
> > > 
> > > 
> > > Thanks, hopefully I can start investigating this Friday
> > > 
> > > -- Vossel
> > 
> > Yeah, this is a bad one. Adding the node attributes using crm_attribute for
> > the remote-node did some unexpected things to the crmd component. Somehow
> > the remote-node was getting entered into the cluster node cache... which
> > made it look like we had both a cluster-node and remote-node named the same
> > thing... not good.
> > 
> > I think I got that part worked out. Try this patch.
> > 
> > https://github.com/ClusterLabs/pacemaker/commit/67dfff76d632f1796c9ded8fd367aa49258c8c32
> > 
> > Rather than trying to patch RCs, it might be worth trying out the master
> > branch on github (which already has this patch). If you aren't already, use
> > rpms to make your life easier. Running 'make rpm' in the source directory
> > will generate them for you.
> > 
> > There was another bug fixed recently in pacemaker_remote involving the
> > directory created for resource agents to store their temporary data (stuff
> > like pid files). I believe the fix was not introduced until 1.1.10rc6.
> > 
> > -- Vossel
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>