[Pacemaker] Pacemaker remote nodes, naming, and attributes

Tue Jul 2 18:36:43 EDT 2013

You didn't notice that after setting attributes on "db02", the remote node
"db02" went offline as "unclean", even though vm-db02 was still running?
 That strikes me as wrong!  Once it gets into this state, I can order
vm-db02 to stop, but it never will.  Indeed, pacemaker doesn't do much at
this point -- I can put everything into standby mode, and services don't
shut down.  That is why the forcible reboot.  Also, why I don't know (yet)
what would happen to a service on db02 when this happens -- it takes too
long to restart the cluster to carry out too many tests in one day!

I'll review asymmetrical clusters -- I think my mistake was thinking an
infinite score location constraint to put DummyOnVM on db02 would prevent
it from running anywhere else, but of course of db02 isn't running, my one
rule isn't equivalent to having -inf scores elsewhere.  Still odd that
shutting down vm-db02 would trigger a migration of an unrelated VM.  (The
fact that would also stop vm-swbuild is the known problem that constraints
don't work well with migration.)

On Tue, Jul 2, 2013 at 6:20 PM, David Vossel <dvossel at redhat.com> wrote:

> ----- Original Message -----
> > From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> > To: "The Pacemaker cluster resource manager" <
> pacemaker at oss.clusterlabs.org>
> > Sent: Tuesday, July 2, 2013 4:05:22 PM
> > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> >
> > Sorry for the delayed response, but I was out last week. I've applied
> this
> > patch to 1.1.10-rc5 and have been testing:
> >
>
> Thanks for testing :)
>
> >
> >
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > --update "true"
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > scope=status name=service_postgresql value=true
> > # crm resource stop vm-db02
> > # crm resource start vm-db02
> > ### Wait a bit
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > scope=status name=service_postgresql value=(null)
> > Error performing operation: No such device or address
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > --update "true"
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > scope=status name=service_postgresql value=true
> >
> > Good so far. But now look at this (every node was clean, and all services
> > were running, before we started):
> >
> >
> >
> > # crm status
> > Last updated: Tue Jul 2 16:15:14 2013
> > Last change: Tue Jul 2 16:15:12 2013 via crmd on cvmh02
> > Stack: cman
> > Current DC: cvmh02 - partition with quorum
> > Version: 1.1.10rc5-1.el6.ccni-2718638
> > 9 Nodes configured, unknown expected votes
> > 59 Resources configured.
> >
> >
> > Node db02: UNCLEAN (offline)
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > ldap02:vm-ldap02 ]
> > OFFLINE: [ swbuildsl6:vm-swbuildsl6 ]
> >
> > Full list of resources:
> >
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Stopped
> > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01
> >
> > Not so good, and I'm not sure how to clean this up. I can't seem to stop
>
> clean what up?  I don't understand what I'm expected to notice out of
> place here?!  The remote-node us up, everything looks happy.
>
> > vm-db02 any more, even after I've entered:
> >
> >
> >
> > # crm_node -R db02 --force
>
> That won't stop the remote-node. 'crm resource stop vm-db02' should though.
>
> > # crm resource start vm-db02
>
> ha, I'm so confused. why are you trying to start it? I thought you were
> trying to stop the resource?
>
> >
> >
> >
> > ### Wait a bit
> >
> >
> >
> > # crm status
> > Last updated: Tue Jul 2 16:32:38 2013
> > Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01
> > Stack: cman
> > Current DC: cvmh02 - partition with quorum
> > Version: 1.1.10rc5-1.el6.ccni-2718638
> > 8 Nodes configured, unknown expected votes
> > 54 Resources configured.
> >
> >
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02
> > swbuildsl6:vm-swbuildsl6 ]
> > OFFLINE: [ db02:vm-db02 ]
> >
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh03
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01
> >
> > My only recourse has been to reboot the cluster.
> >
> > So let's do that and try
> > setting a location constraint on DummyOnVM, to force it on db02...
> >
> >
> >
> >
> >
> >
> >
> > Last updated: Tue Jul 2 16:43:46 2013
> > Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01
> > Stack: cman
> > Current DC: cvmh02 - partition with quorum
> > Version: 1.1.10rc5-1.el6.ccni-2718638
> > 8 Nodes configured, unknown expected votes
> > 54 Resources configured.
> >
> >
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> >
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh03
> >
> > # pcs constraint location DummyOnVM prefers db02
> > # crm status
> > ...
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> > ...
> > DummyOnVM (ocf::pacemaker:Dummy): Started db02
> >
> >
> > That's what we want to see. It would be interesting to stop db02. I
> expect
> > DummyOnVM to stop.
>
> OH, okay, so you wanted DummyOnVM  to start on db02.
>
> >
> >
> >
> > # crm resource stop vm-db02
> > # crm status
> > ...
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
> > OFFLINE: [ db02:vm-db02 swbuildsl6:vm-swbuildsl6 ]
> > ...
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh02
> >
> > Failed actions:
> > vm-compute-test_migrate_from_0 (node=cvmh02, call=147, rc=1, status=Timed
> > Out, last-rc-change=Tue Jul 2 16:48:17 2013
> > , queued=20003ms, exec=0ms
> > ): unknown error
> >
> > Well, that is odd. (It is the case that vm-swbuildsl6 has an order
> dependency
> > on vm-compute-test, as I was trying to understand how migrations worked
> with
> > order dependencies (not very well).
>
> I don't think this failure has anything to do with the order dependencies.
> If pacemaker attempted to live migrate the vm and it fails, that's a
> resource  problem.  Do you have your virtual machine images on shared
> storage?
>
> > Once vm-compute-test recovers,
> > vm-swbuildsl6 does come back up.) This isn't really very good -- if I am
> > running services in VM or other containers, I need them to run only in
> that
> > container!
>
> Read about the differences between asymmetrical and symmetrical clusters.
>  I think this will help this make sense.  By default resources can run
> anywhere, you just gave more weight to db02 for the Dummy resource, meaning
> it prefers that node when it is around.
>
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on
>
>
> >
> > If I start vm-db02 back up, I see that DummyOnVM is stopped and moved to
> > db02.
>
> Yep, this is what I'd expect for a symmetrical cluster.
>
> Thanks again for the feedback, hope the asymmetrical/symmetrical cluster
> stuff helps :)
>
> -- Vossel
>
> >
> >
> > On Thu, Jun 20, 2013 at 4:16 PM, David Vossel < dvossel at redhat.com >
> wrote:
> >
> >
> >
> > ----- Original Message -----
> > > From: "David Vossel" < dvossel at redhat.com >
> > > To: "The Pacemaker cluster resource manager" <
> > > pacemaker at oss.clusterlabs.org >
> > > Sent: Thursday, June 20, 2013 1:35:44 PM
> > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > >
> > > ----- Original Message -----
> > > > From: "David Vossel" < dvossel at redhat.com >
> > > > To: "The Pacemaker cluster resource manager"
> > > > < pacemaker at oss.clusterlabs.org >
> > > > Sent: Wednesday, June 19, 2013 4:47:58 PM
> > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and
> attributes
> > > >
> > > > ----- Original Message -----
> > > > > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > > > > To: "The Pacemaker cluster resource manager"
> > > > > < Pacemaker at oss.clusterlabs.org >
> > > > > Sent: Wednesday, June 19, 2013 4:11:58 PM
> > > > > Subject: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > > >
> > > > > I built a set of rpms for pacemaker 1.1.0-rc4 and updated my test
> > > > > cluster
> > > > > (hopefully won't be a "test" cluster forever), as well as my VMs
> > > > > running
> > > > > pacemaker-remote. The OS everywhere is Scientific Linux 6.4. I am
> > > > > wanting
> > > > > to
> > > > > set some attributes on remote nodes, which I can use to control
> where
> > > > > services run.
> > > > >
> > > > > The first deviation I note from the documentation is the naming of
> the
> > > > > remote
> > > > > nodes. I see:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Last updated: Wed Jun 19 16:50:39 2013
> > > > > Last change: Wed Jun 19 16:19:53 2013 via cibadmin on cvmh04
> > > > > Stack: cman
> > > > > Current DC: cvmh02 - partition with quorum
> > > > > Version: 1.1.10rc4-1.el6.ccni-d19719c
> > > > > 8 Nodes configured, unknown expected votes
> > > > > 49 Resources configured.
> > > > >
> > > > >
> > > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> > > > >
> > > > > Full list of resources:
> > > > >
> > > > > and so forth. The "remote-node" names are simply the hostname, so
> the
> > > > > vm-db02
> > > > > VirtualDomain resource has a remote-node name of db02. The
> "Pacemaker
> > > > > Remote" manual suggests this should be displayed as "db02", not
> > > > > "db02:vm-db02", although I can see how the latter format would be
> > > > > useful.
> > > >
> > > > Yep, this got changed since the documentation was published. We
> wanted
> > > > people to be able to recognize which remote-node went with which
> resource
> > > > easily.
> > > >
> > > > >
> > > > > So now let's set an attribute on this remote node. What name do I
> use?
> > > > > How
> > > > > about:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > # crm_attribute --node "db02:vm-db02" \
> > > > > --name "service_postgresql" \
> > > > > --update "true"
> > > > > Could not map name=db02:vm-db02 to a UUID
> > > > > Please choose from one of the matches above and suppy the 'id' with
> > > > > --attr-id
> > > > >
> > > > > Perhaps not the most informative output, but obviously it fails.
> Let's
> > > > > try
> > > > > the unqualified name:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > # crm_attribute --node "db02" \
> > > > > --name "service_postgresql" \
> > > > > --update "true"
> > > > > Remote-nodes do not maintain permanent attributes,
> > > > > 'service_postgresql=true'
> > > > > will be removed after db02 reboots.
> > > > > Error setting service_postgresql=true (section=status,
> > > > > set=status-db02):
> > > > > No
> > > > > such device or address
> > > > > Error performing operation: No such device or address
> > >
> > > I just tested this and ran into the same errors you did. Turns out this
> > > happens when the remote-node's status section is empty. If you start a
> > > resource on the node and then set the attribute it will work...
> obviously
> > > this is a bug. I'm working on a fix.
> >
> > This should help with the attributes bit.
> >
> >
> https://github.com/ClusterLabs/pacemaker/commit/26d34a9171bddae67c56ebd8c2513ea8fa770204
> >
> > -- Vossel
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130702/9116b03b/attachment-0003.html>