[Pacemaker] Pacemaker remote nodes, naming, and attributes

Thu Jul 11 17:09:52 UTC 2013

So I've switched my cluster to be asymmetric...

The remote nodes don't appear (although the VMs start, and pacemaker_remote
runs on them).  This is a problem.

Temporarily switch to symmetric without restarting everything, and the
remote nodes appear in the cluster status.  (But I'd have to do more
reconfiguration here.)  Seems like the "remote node" resource won't start
on the VM without some imaginary location constraint to help it?

Switch back to asymmetric, and the remote nodes are "offline" or "UNCLEAN",
and pengine is dumping core.  I didn't really expect it to work well, and
such drastic changes should probably be done in maintenance mode, at least,
but in case it is of help in diagnosing the original problem:  "pcs status"
starts like this:

# pcs status
Last updated: Thu Jul 11 12:53:00 2013
Last change: Thu Jul 11 12:43:18 2013 via crmd on cvmh02
Stack: cman
Current DC: cvmh03 - partition with quorum
Version: 1.1.10-3.el6.ccni-bead5ad
11 Nodes configured, unknown expected votes
69 Resources configured.

Node db02:vm-db02: UNCLEAN (offline)
Node ldap01: UNCLEAN (offline)
Node ldap02: UNCLEAN (offline)
Node swbuildsl6: UNCLEAN (offline)
Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ]
OFFLINE: [ ldap01:vm-ldap01 ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]

Traceback:

# gdb /usr/libexec/pacemaker/pengine /var/lib/heartbeat/cores/cor
e.22873
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/libexec/pacemaker/pengine...Reading symbols from
/usr/
lib/debug/usr/libexec/pacemaker/pengine.debug...done.
done.
[New Thread 22873]
...

(gdb) where
#0  0x0000003f5b00d6ec in sort_rsc_process_order (a=<value optimized out>,
    b=<value optimized out>, data=<value optimized out>) at allocate.c:1043
#1  0x0000003b67a36979 in ?? () from /lib64/libglib-2.0.so.0
#2  0x0000003b67a3691d in ?? () from /lib64/libglib-2.0.so.0
#3  0x0000003b67a3691d in ?? () from /lib64/libglib-2.0.so.0
#4  0x0000003b67a3691d in ?? () from /lib64/libglib-2.0.so.0
#5  0x0000003b67a3692e in ?? () from /lib64/libglib-2.0.so.0
#6  0x0000003f5b01267a in stage5 (data_set=0x7fff67a6b260) at
allocate.c:1149
#7  0x0000003f5b009b71 in do_calculations (data_set=0x7fff67a6b260,
    xml_input=<value optimized out>, now=<value optimized out>)
    at pengine.c:252
#8  0x0000003f5b00a7b2 in process_pe_message (msg=0xbeb710,
xml_data=0xbec0b0,
    sender=0xbe2f10) at pengine.c:126
#9  0x000000000040142f in pe_ipc_dispatch (qbc=<value optimized out>,
    data=<value optimized out>, size=29292) at main.c:79
#10 0x0000003b6ae0e874 in ?? () from /usr/lib64/libqb.so.0
#11 0x0000003b6ae0ebc4 in qb_ipcs_dispatch_connection_request ()
   from /usr/lib64/libqb.so.0
#12 0x0000003f5982b0a0 in gio_read_socket (gio=<value optimized out>,
    condition=G_IO_IN, data=0xbe8540) at mainloop.c:453
#13 0x0000003b67a38f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#14 0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0
#15 0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#16 0x0000000000401738 in main (argc=1, argv=0x7fff67a6b858) at main.c:182

After a reboot, I had to remove the nodes ldap01, ldap02, swbuildsl6.  My
cluster is again working in asymmetric mode, except the remote nodes are
not appearing online:

# pcs status
Last updated: Thu Jul 11 13:08:22 2013
Last change: Thu Jul 11 13:01:39 2013 via crm_node on cvmh01
Stack: cman
Current DC: cvmh02 - partition with quorum
Version: 1.1.10-3.el6.ccni-bead5ad
8 Nodes configured, unknown expected votes
54 Resources configured.

Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
OFFLINE: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02
swbuildsl6:vm-swbuildsl6 ]

Full list of resources:

 fence-cvmh01   (stonith:fence_ipmilan):        Started cvmh04
 fence-cvmh02   (stonith:fence_ipmilan):        Started cvmh03
 fence-cvmh03   (stonith:fence_ipmilan):        Started cvmh04
 fence-cvmh04   (stonith:fence_ipmilan):        Started cvmh01
 Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
     Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 Clone Set: c-p-libvirtd [p-libvirtd]
     Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
     Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 Clone Set: c-watch-ib0 [p-watch-ib0]
     Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 Clone Set: c-fs-gpfs [p-fs-gpfs]
     Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
 vm-compute-test        (ocf::ccni:xcatVirtualDomain):  Started cvmh01
 vm-swbuildsl6  (ocf::ccni:xcatVirtualDomain):  Started cvmh01
 vm-db02        (ocf::ccni:xcatVirtualDomain):  Started cvmh02
 vm-ldap01      (ocf::ccni:xcatVirtualDomain):  Started cvmh03
 vm-ldap02      (ocf::ccni:xcatVirtualDomain):  Started cvmh04
 DummyOnVM      (ocf::pacemaker:Dummy): Stopped

On Wed, Jul 10, 2013 at 6:43 PM, Lindsay Todd <rltodd.ml1 at gmail.com> wrote:

> Yes, it avoids the crashes.  Thanks!  But I am still seeing spurious VM
> migrations/shutdowns when I stop/start a VM with a remote pacemaker
> (similar to my last update, only no core dumped while fencing, nor indeed
> does any fencing happen, even though I've now verified that fence_node
> works again.
>
>
> On Wed, Jul 10, 2013 at 2:12 PM, David Vossel <dvossel at redhat.com> wrote:
>
>> ----- Original Message -----
>> > From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
>> > To: "The Pacemaker cluster resource manager" <
>> pacemaker at oss.clusterlabs.org>
>> > Sent: Wednesday, July 10, 2013 12:11:00 PM
>> > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
>> >
>> > Hmm, I'll still submit the bug report, but it seems like crmd is
>> dumping core
>> > while attempting to fence a node. If I use fence_node to fence a real
>> > cluster node, that also causes crmd to dump core. But apart from that, I
>> > don't really see why pacemaker is trying to fence anything.
>>
>> This should solve the crashes you are seeing.
>>
>>
>> https://github.com/ClusterLabs/pacemaker/commit/97dd3b05db867c4674fa4780802bba54c63bd06d
>>
>> -- Vossel
>>
>> >
>> >
>> > On Wed, Jul 10, 2013 at 12:42 PM, Lindsay Todd < rltodd.ml1 at gmail.com >
>> > wrote:
>> >
>> >
>> >
>> > Thanks! But there is still a problem.
>> >
>> > I am now working from the master branch and building RPMs (well, I have
>> to
>> > also rebuild from the srpm to change the build number, since the RPMs
>> built
>> > directly are always 1.1.10-1). The patch is in the git log, and indeed
>> > things are better ... But I still see the spurious VMs shutting down.
>> What
>> > is much improved is that they do get restarted, and basically I end up
>> in
>> > the state I want to be. Can almost live with this, and I was going to
>> start
>> > changing my cluster config to be asymmetric when I noticed the in the
>> midst
>> > of the spurious transitions, crmd is dumping core.
>> >
>> > So I'll append another crm_report to bug 5164, as well as a gdb
>> traceback.
>> >
>> >
>> > On Fri, Jul 5, 2013 at 5:06 PM, David Vossel < dvossel at redhat.com >
>> wrote:
>> >
>> >
>> >
>> > ----- Original Message -----
>> > > From: "David Vossel" < dvossel at redhat.com >
>> > > To: "The Pacemaker cluster resource manager" <
>> > > pacemaker at oss.clusterlabs.org >
>> > > Sent: Wednesday, July 3, 2013 4:20:37 PM
>> > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and
>> attributes
>> > >
>> > > ----- Original Message -----
>> > > > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
>> > > > To: "The Pacemaker cluster resource manager"
>> > > > < pacemaker at oss.clusterlabs.org >
>> > > > Sent: Wednesday, July 3, 2013 2:12:05 PM
>> > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and
>> attributes
>> > > >
>> > > > Well, I'm not getting failures right now simply with attributes,
>> but I
>> > > > can
>> > > > induce a failure by stopping the vm-db02 (it puts db02 into an
>> unclean
>> > > > state, and attempts to migrate the unrelated vm-compute-test). I've
>> > > > collected the commands from my latest interactions, a crm_report,
>> and a
>> > > > gdb
>> > > > traceback from the core file that crmd dumped, into bug 5164.
>> > >
>> > >
>> > > Thanks, hopefully I can start investigating this Friday
>> > >
>> > > -- Vossel
>> >
>> > Yeah, this is a bad one. Adding the node attributes using crm_attribute
>> for
>> > the remote-node did some unexpected things to the crmd component.
>> Somehow
>> > the remote-node was getting entered into the cluster node cache... which
>> > made it look like we had both a cluster-node and remote-node named the
>> same
>> > thing... not good.
>> >
>> > I think I got that part worked out. Try this patch.
>> >
>> >
>> https://github.com/ClusterLabs/pacemaker/commit/67dfff76d632f1796c9ded8fd367aa49258c8c32
>> >
>> > Rather than trying to patch RCs, it might be worth trying out the master
>> > branch on github (which already has this patch). If you aren't already,
>> use
>> > rpms to make your life easier. Running 'make rpm' in the source
>> directory
>> > will generate them for you.
>> >
>> > There was another bug fixed recently in pacemaker_remote involving the
>> > directory created for resource agents to store their temporary data
>> (stuff
>> > like pid files). I believe the fix was not introduced until 1.1.10rc6.
>> >
>> > -- Vossel
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130711/a1795c10/attachment.htm>