[Pacemaker] Asymmetric cluster, clones, and location constraints

Wed Oct 23 19:38:17 UTC 2013

David,

The Infiniband network takes a nondeterministic amount of time to actually
finish initializing, so we use ethmonitor to watch it; the OS is supposed
to bring it up at boot time, but it moves on through the boot sequence
without actually waiting for it.  So in self defense we watch it with
pacemaker.  I guess I could restructure this to use a resource that brings
up IB (with a really long time out) and use ordering to wait for that
complete, but it seems that ethmonitor would be more adaptive to short-term
IB network issues.  Since ethmonitor works by setting an attribute (the RA
running means it is watching the network, not that the network is up), I've
used location constraints instead of ordering constraints.

So I have completely restarted my cluster.  Right now the physical nodes
see each other, and the fencing agents are running.  The first thing that
should start are the ethmonitor resource agents on the VM hosts (the
c-watch-ib0 clones of the p-watch-ib0 primitive).  They are not starting
(like they used to).  The cib snapshot can be seen in
http://pastebin.com/TccTHQPS  (some slight editing to hide passwords in
fencing agents).

/Lindsay

On Wed, Oct 23, 2013 at 11:20 AM, David Vossel <dvossel at redhat.com> wrote:

> ----- Original Message -----
> > From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> > To: "The Pacemaker cluster resource manager" <
> Pacemaker at oss.clusterlabs.org>
> > Sent: Tuesday, October 22, 2013 4:19:11 PM
> > Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
> >
> > I am getting rather unexpected behavior when I combine clones, location
> > constraints, and remote nodes in an asymmetric cluster. My cluster is
> > configured to be asymmetric, distinguishing between vmhosts and various
> > sorts of remote nodes. Currently I am running upstream version b6d42ed.
> I am
> > simplifying my description to avoid confusion, hoping in so doing I don't
> > miss any salient points...
> >
> > My physical cluster nodes, also the VM hosts, have the attribute
> > "nodetype=vmhost". They also have Infiniband interfaces, which take some
> > time to come up. I don't want my shared file system (which needs IB), or
> > libvirtd (which needs the file system), to come up before IB... So I have
> > this in my configuration:
> >
> >
> >
> >
> > primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
> > params \
> > interface="ib0" \
> > op monitor timeout="100s" interval="10s"
> > clone c-watch-ib0 p-watch-ib0 \
> > meta interleave="true"
> > #
> > location loc-watch-ib-only-vmhosts c-watch-ib0 \
> > rule 0: nodetype eq "vmhost"
> >
> > Something broke between upstream versions 0a2570a and c68919f -- the
> > c-watch-ib0 clone never starts. I've found that if I run "crm_resource
> > --force-start -r p-watch-ib0" when IB is running, the ethmonitor-ib0
> > attribute is not set like it used to be. Oh well, I can set it manually.
> So
> > let's.
>
> A re-write of the attrd component was introduced around that time period.
>  This should have been resolved at this point in the b6d42ed build.
>
> > We use GPFS for a shared file system, so I have an agent to start it and
> wait
> > for a file system to mount. It should only run on VM hosts, and only
> when IB
> > is running. So I have this:
>
> So the IB resource is setting some attribute that enables the fs to run?
>  Why can't a ordering constraint be used here between IB and FS?
>
> >
> >
> >
> >
> > primitive p-fs-gpfs ocf:ccni:gpfs \
> > params \
> > fspath="/gpfs/lb/utility" \
> > op monitor timeout="20s" interval="30s" \
> > op start timeout="180s" \
> > op stop timeout="120s"
> > clone c-fs-gpfs p-fs-gpfs \
> > meta interleave="true"
> > location loc-fs-gpfs-needs-ib0 c-fs-gpfs \
> > rule -inf: not_defined "ethmonitor-ib0" or "ethmonitor-ib0" eq 0
> > location loc-fs-gpfs-on-vmhosts c-fs-gpfs \
> > rule 0: nodetype eq "vmhost"
> >
> > That all used to start nicely. Now even if I set the ethmonitor-ib0
> > attribute, it doesn't. However, I can use "crm_resource --force-start -r
> > p-fs-gpfs" on each of my VM hosts, then issue "crm resource cleanup
> > c-fs-gpfs", and all is well. I can use "crm status" to see something
> like:
> >
> >
> >
> > Last updated: Tue Oct 22 16:35:43 2013
> > Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
> > Stack: cman
> > Current DC: cvmh04 - partition with quorum
> > Version: 1.1.10-19.el6.ccni-b6d42ed
> > 8 Nodes configured
> > 92 Resources configured
> >
> >
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > which is what I would expect (other than I expect pacemaker to have
> started
> > these for me, like it used to).
> >
> > Now I also have clone resources to NFS-mount another file system, and
> > actually do a bind mount out of the GPFS file system, which behave like
> the
> > GPFS resource -- they used to just work, now I need to use "crm_resource
> > --force-start" and clean up. That finally lets me start libvirtd, using
> this
> > configuration:
> >
> >
> >
> >
> > primitive p-libvirtd lsb:libvirtd \
> > op monitor interval="30s"
> > clone c-p-libvirtd p-libvirtd \
> > meta interleave="true"
> > order o-libvirtd-after-storage inf: \
> > ( c-fs-libvirt-VM-xcm c-fs-bind-libvirt-VM-cvmh ) \
> > c-p-libvirtd
> > location loc-libvirtd-on-vmhosts c-p-libvirtd \
> > rule 0: nodetype eq "vmhost"
> >
> > Of course that used to just work, but now, like the other clones, I need
> to
> > force-start libvirtd on the VM hosts, and clean up. Once I do that, all
> my
> > VM resources, which are not clones, just start up like they are supposed
> to!
> > Several of these are configured as remote nodes, and they have services
> > configured to run in them. But now other strange things happen:
> >
> >
> >
> >
> > Last updated: Tue Oct 22 16:46:29 2013
> > Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
> > Stack: cman
> > Current DC: cvmh04 - partition with quorum
> > Version: 1.1.10-19.el6.ccni-b6d42ed
> > 8 Nodes configured
> > 92 Resources configured
> >
> >
> > ContainerNode slurmdb02:vm-slurmdb02: UNCLEAN (offline)
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Containers: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
> >
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > p-libvirtd (lsb:libvirtd): FAILED slurmdb02
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > p-watch-ib0 (ocf::heartbeat:ethmonitor): FAILED slurmdb02
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > p-fs-gpfs (ocf::ccni:gpfs): FAILED slurmdb02
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): FAILED [ cvmh04 slurmdb0
> > 2 ]
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): FAILED slurmdb02
> > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > p-postgres (ocf::heartbeat:pgsql): FAILED [ db02 slurmdb02 ]
> > p-mysql (ocf::heartbeat:mysql): FAILED [ db02 slurmdb02 ]
> > Clone Set: c-fs-share-config-data [fs-share-config-data]
> > fs-share-config-data (ocf::heartbeat:Filesystem): FAILED slurmdb02
> > Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
> > p-mysql-slurm (ocf::heartbeat:mysql): FAILED slurmdb02
> > p-slurmdbd (ocf::ccni:SlurmDBD): FAILED slurmdb02
> > Clone Set: c-ldapagent [s-ldapagent]
> > s-ldapagent (ocf::ccni:WrapInitScript): FAILED slurmdb02
> > Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
> > Clone Set: c-ldap [s-ldap]
> > s-ldap (ocf::ccni:WrapInitScript): FAILED slurmdb02
> > Started: [ ldap01 ldap02 ]
> > Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ]
> >
> > Now this is unexpected for a couple of reasons. I do have constraints
> like:
> >
> >
> >
> >
> > location loc-vm-swbuildsl6 vm-swbuildsl6 \
> > rule $id="loc-vm-swbuildsl6-rule" 0: nodetype eq vmhost
> > order o-vm-swbuildsl6 inf: c-p-libvirtd vm-swbuildsl6
> >
> > And it is not the case that slurmdb02 has the vmhost attribute set; using
> > "crm_mon -o -1 -N -A" we see:
> >
> >
> >
> >
> > Node Attributes:
> > * Node cvmh01:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node cvmh02:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node cvmh03:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node cvmh04:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node db02:
> > * Node ldap01:
> > * Node ldap02:
> > * Node slurmdb02:
> >
> > The results are unexpected to me also because I (perhaps naively)
> wouldn't
> > expect it to show me the new nodes on the "stopped" lines -- I kind of
> > expected a location rule to limit where clones would even be attempted.
> For
> > example, with the rule limiting c-p-libvirtd to the vmhosts, I don't
> really
> > expect to be told that the clones are stopped on the remote VM nodes
> db02,
> > ldap01, and ldap02 (let alone be started on slurmdb02!).
> >
> > Until I wrote this note, even the cloned ldap resource c-ldap needed to
> be
> > started using force-start. Not sure why this time it started on its
> own...
> > Perhaps this stack trace in the core dump pacemaker left on one of the VM
> > hosts has a clue?
> >
> >
> >
> >
> >
> > #0 0x00007f121e9ac8e5 in raise () from /lib64/libc.so.6
> > #1 0x00007f121e9ae0c5 in abort () from /lib64/libc.so.6
> > #2 0x00007f121e9ea7f7 in __libc_message () from /lib64/libc.so.6
> > #3 0x00007f121e9f0126 in malloc_printerr () from /lib64/libc.so.6
> > #4 0x00007f121e9f05ad in malloc_consolidate () from /lib64/libc.so.6
> > #5 0x00007f121e9f33c5 in _int_malloc () from /lib64/libc.so.6
> > #6 0x00007f121e9f45e6 in calloc () from /lib64/libc.so.6
> > #7 0x00007f121e9e91ed in open_memstream () from /lib64/libc.so.6
> > #8 0x00007f121ea5ebdb in __vsyslog_chk () from /lib64/libc.so.6
> > #9 0x00007f121ea5f1b3 in __syslog_chk () from /lib64/libc.so.6
> > #10 0x00007f121e72b9fb in ?? () from /usr/lib64/libqb.so.0
> > #11 0x00007f121e72a6a2 in qb_log_real_va_ () from /usr/lib64/libqb.so.0
> > #12 0x00007f121e72a91d in qb_log_real_ () from /usr/lib64/libqb.so.0
> > #13 0x000000000042e994 in te_rsc_command (graph=0x20c7b40,
> action=0x23b0c90)
> > at te_actions.c:412
>
> This is crashing at a log message.  Apparently we are trying to plug a
> "NULL" pointer into one of the format strings "%s" entries.  Looking at
> that log message, none of those values should be NULL, something is wrong
> here.
>
>
> > #14 0x0000003a64404019 in initiate_action (graph=0x20c7b40) at
> graph.c:172
> > #15 fire_synapse (graph=0x20c7b40) at graph.c:211
> > #16 run_graph (graph=0x20c7b40) at graph.c:366
> > #17 0x000000000042f8cd in te_graph_trigger (user_data=<value optimized
> out>)
> > at te_utils.c:331
> > #18 0x0000003a6202b283 in crm_trigger_dispatch (source=<value optimized
> out>,
> > callback=<value optimized out>, userdata=<value optimized out>)
> > at mainloop.c:105
> > #19 0x00000038b3c38f0e in g_main_context_dispatch ()
> > from /lib64/libglib-2.0.so.0
> > #20 0x00000038b3c3c938 in ?? () from /lib64/libglib-2.0.so.0
> > #21 0x00000038b3c3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
> > #22 0x00000000004058ee in crmd_init () at main.c:154
> > #23 0x0000000000405c2c in main (argc=1, argv=0x7fffdc207528) at
> main.c:121
> >
> > Not sure how to take this further. It has been difficult to characterize
> what
> > exactly is or isn't happening, and hopefully I've not left out some
> critical
> > detail. Thanks.
>
> There is a whole lot going on here, which is making it a bit difficult to
> know where to start.  You are using attributes and rules to enable
> resources.  The attrd has recently been re-written which could have caused
> some of the problems you are seeing (especially if you ever attempted to
> write an attribute to remote-node using a build from sometime in September)
>
> To make this easier to understand I'd recommend this... Get to the point
> where you'd expect a resource to start and it isn't.  Capture the cib
> "cibadmin -q > cibsnapshot.cib".  pastebin the cib and tell us which
> resource you'd expect to be starting.  Then we can try and determine
> accurately what is preventing it from starting.  That will at least give us
> something solid to work from.
>
> -- Vossel
>
> > /Lindsay
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131023/d8db792b/attachment.htm>