[Pacemaker] Pacemaker remote nodes, naming, and attributes

Tue Jul 2 18:20:37 EDT 2013

----- Original Message -----
> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, July 2, 2013 4:05:22 PM
> Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> 
> Sorry for the delayed response, but I was out last week. I've applied this
> patch to 1.1.10-rc5 and have been testing:
> 

Thanks for testing :)

> 
> 
> # crm_attribute --type status --node "db02" --name "service_postgresql"
> --update "true"
> # crm_attribute --type status --node "db02" --name "service_postgresql"
> scope=status name=service_postgresql value=true
> # crm resource stop vm-db02
> # crm resource start vm-db02
> ### Wait a bit
> # crm_attribute --type status --node "db02" --name "service_postgresql"
> scope=status name=service_postgresql value=(null)
> Error performing operation: No such device or address
> # crm_attribute --type status --node "db02" --name "service_postgresql"
> --update "true"
> # crm_attribute --type status --node "db02" --name "service_postgresql"
> scope=status name=service_postgresql value=true
> 
> Good so far. But now look at this (every node was clean, and all services
> were running, before we started):
> 
> 
> 
> # crm status
> Last updated: Tue Jul 2 16:15:14 2013
> Last change: Tue Jul 2 16:15:12 2013 via crmd on cvmh02
> Stack: cman
> Current DC: cvmh02 - partition with quorum
> Version: 1.1.10rc5-1.el6.ccni-2718638
> 9 Nodes configured, unknown expected votes
> 59 Resources configured.
> 
> 
> Node db02: UNCLEAN (offline)
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> ldap02:vm-ldap02 ]
> OFFLINE: [ swbuildsl6:vm-swbuildsl6 ]
> 
> Full list of resources:
> 
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-p-libvirtd [p-libvirtd]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-watch-ib0 [p-watch-ib0]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-fs-gpfs [p-fs-gpfs]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh03
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Stopped
> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01
>
> Not so good, and I'm not sure how to clean this up. I can't seem to stop

clean what up?  I don't understand what I'm expected to notice out of place here?!  The remote-node us up, everything looks happy.

> vm-db02 any more, even after I've entered:
> 
> 
> 
> # crm_node -R db02 --force

That won't stop the remote-node. 'crm resource stop vm-db02' should though.

> # crm resource start vm-db02

ha, I'm so confused. why are you trying to start it? I thought you were trying to stop the resource?

> 
> 
> 
> ### Wait a bit
> 
> 
> 
> # crm status
> Last updated: Tue Jul 2 16:32:38 2013
> Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01
> Stack: cman
> Current DC: cvmh02 - partition with quorum
> Version: 1.1.10rc5-1.el6.ccni-2718638
> 8 Nodes configured, unknown expected votes
> 54 Resources configured.
> 
> 
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02
> swbuildsl6:vm-swbuildsl6 ]
> OFFLINE: [ db02:vm-db02 ]
> 
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh03
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-p-libvirtd [p-libvirtd]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-watch-ib0 [p-watch-ib0]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-fs-gpfs [p-fs-gpfs]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh02
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01
> 
> My only recourse has been to reboot the cluster.
>
> So let's do that and try
> setting a location constraint on DummyOnVM, to force it on db02...
> 
> 
> 
> 
> 
> 
> 
> Last updated: Tue Jul 2 16:43:46 2013
> Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01
> Stack: cman
> Current DC: cvmh02 - partition with quorum
> Version: 1.1.10rc5-1.el6.ccni-2718638
> 8 Nodes configured, unknown expected votes
> 54 Resources configured.
> 
> 
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> 
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-p-libvirtd [p-libvirtd]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-watch-ib0 [p-watch-ib0]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> Clone Set: c-fs-gpfs [p-fs-gpfs]
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh01
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh03
> 
> # pcs constraint location DummyOnVM prefers db02
> # crm status
> ...
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> ...
> DummyOnVM (ocf::pacemaker:Dummy): Started db02
> 
> 
> That's what we want to see. It would be interesting to stop db02. I expect
> DummyOnVM to stop.

OH, okay, so you wanted DummyOnVM  to start on db02.

> 
> 
> 
> # crm resource stop vm-db02
> # crm status
> ...
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
> OFFLINE: [ db02:vm-db02 swbuildsl6:vm-swbuildsl6 ]
> ...
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh02
> 
> Failed actions:
> vm-compute-test_migrate_from_0 (node=cvmh02, call=147, rc=1, status=Timed
> Out, last-rc-change=Tue Jul 2 16:48:17 2013
> , queued=20003ms, exec=0ms
> ): unknown error
> 
> Well, that is odd. (It is the case that vm-swbuildsl6 has an order dependency
> on vm-compute-test, as I was trying to understand how migrations worked with
> order dependencies (not very well).

I don't think this failure has anything to do with the order dependencies. If pacemaker attempted to live migrate the vm and it fails, that's a resource  problem.  Do you have your virtual machine images on shared storage?

> Once vm-compute-test recovers,
> vm-swbuildsl6 does come back up.) This isn't really very good -- if I am
> running services in VM or other containers, I need them to run only in that
> container!

Read about the differences between asymmetrical and symmetrical clusters.  I think this will help this make sense.  By default resources can run anywhere, you just gave more weight to db02 for the Dummy resource, meaning it prefers that node when it is around.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on

> 
> If I start vm-db02 back up, I see that DummyOnVM is stopped and moved to
> db02.

Yep, this is what I'd expect for a symmetrical cluster.

Thanks again for the feedback, hope the asymmetrical/symmetrical cluster stuff helps :)

-- Vossel

> 
> 
> On Thu, Jun 20, 2013 at 4:16 PM, David Vossel < dvossel at redhat.com > wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "David Vossel" < dvossel at redhat.com >
> > To: "The Pacemaker cluster resource manager" <
> > pacemaker at oss.clusterlabs.org >
> > Sent: Thursday, June 20, 2013 1:35:44 PM
> > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > 
> > ----- Original Message -----
> > > From: "David Vossel" < dvossel at redhat.com >
> > > To: "The Pacemaker cluster resource manager"
> > > < pacemaker at oss.clusterlabs.org >
> > > Sent: Wednesday, June 19, 2013 4:47:58 PM
> > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > 
> > > ----- Original Message -----
> > > > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > > > To: "The Pacemaker cluster resource manager"
> > > > < Pacemaker at oss.clusterlabs.org >
> > > > Sent: Wednesday, June 19, 2013 4:11:58 PM
> > > > Subject: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > > 
> > > > I built a set of rpms for pacemaker 1.1.0-rc4 and updated my test
> > > > cluster
> > > > (hopefully won't be a "test" cluster forever), as well as my VMs
> > > > running
> > > > pacemaker-remote. The OS everywhere is Scientific Linux 6.4. I am
> > > > wanting
> > > > to
> > > > set some attributes on remote nodes, which I can use to control where
> > > > services run.
> > > > 
> > > > The first deviation I note from the documentation is the naming of the
> > > > remote
> > > > nodes. I see:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Last updated: Wed Jun 19 16:50:39 2013
> > > > Last change: Wed Jun 19 16:19:53 2013 via cibadmin on cvmh04
> > > > Stack: cman
> > > > Current DC: cvmh02 - partition with quorum
> > > > Version: 1.1.10rc4-1.el6.ccni-d19719c
> > > > 8 Nodes configured, unknown expected votes
> > > > 49 Resources configured.
> > > > 
> > > > 
> > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> > > > 
> > > > Full list of resources:
> > > > 
> > > > and so forth. The "remote-node" names are simply the hostname, so the
> > > > vm-db02
> > > > VirtualDomain resource has a remote-node name of db02. The "Pacemaker
> > > > Remote" manual suggests this should be displayed as "db02", not
> > > > "db02:vm-db02", although I can see how the latter format would be
> > > > useful.
> > > 
> > > Yep, this got changed since the documentation was published. We wanted
> > > people to be able to recognize which remote-node went with which resource
> > > easily.
> > > 
> > > > 
> > > > So now let's set an attribute on this remote node. What name do I use?
> > > > How
> > > > about:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > # crm_attribute --node "db02:vm-db02" \
> > > > --name "service_postgresql" \
> > > > --update "true"
> > > > Could not map name=db02:vm-db02 to a UUID
> > > > Please choose from one of the matches above and suppy the 'id' with
> > > > --attr-id
> > > > 
> > > > Perhaps not the most informative output, but obviously it fails. Let's
> > > > try
> > > > the unqualified name:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > # crm_attribute --node "db02" \
> > > > --name "service_postgresql" \
> > > > --update "true"
> > > > Remote-nodes do not maintain permanent attributes,
> > > > 'service_postgresql=true'
> > > > will be removed after db02 reboots.
> > > > Error setting service_postgresql=true (section=status,
> > > > set=status-db02):
> > > > No
> > > > such device or address
> > > > Error performing operation: No such device or address
> > 
> > I just tested this and ran into the same errors you did. Turns out this
> > happens when the remote-node's status section is empty. If you start a
> > resource on the node and then set the attribute it will work... obviously
> > this is a bug. I'm working on a fix.
> 
> This should help with the attributes bit.
> 
> https://github.com/ClusterLabs/pacemaker/commit/26d34a9171bddae67c56ebd8c2513ea8fa770204
> 
> -- Vossel
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>