[Pacemaker] Pacemaker remote nodes, naming, and attributes

Wed Jul 3 00:40:13 UTC 2013

----- Original Message -----
> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, July 2, 2013 5:36:43 PM
> Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> 
> You didn't notice that after setting attributes on "db02", the remote node
> "db02" went offline as "unclean", even though vm-db02 was still running?

nope... apparently I'm blind :)

> That strikes me as wrong! Once it gets into this state, I can order vm-db02
> to stop, but it never will. Indeed, pacemaker doesn't do much at this point

I'm really confused about how a remote-node could manage to get into an "UNCLEAN" state. Interesting.  Can you reproduce it easily? A crm_report attached to a bugs.clusterlabs.org issue would be helpful.  If you haven't erased your logs you could still retrieve everything in the report the the specific time period it occurred in.  I definitely need to get that worked out.

> -- I can put everything into standby mode, and services don't shut down.
> That is why the forcible reboot. Also, why I don't know (yet) what would
> happen to a service on db02 when this happens -- it takes too long to
> restart the cluster to carry out too many tests in one day!
> 
> I'll review asymmetrical clusters -- I think my mistake was thinking an
> infinite score location constraint to put DummyOnVM on db02 would prevent it
> from running anywhere else, but of course of db02 isn't running, my one rule
> isn't equivalent to having -inf scores elsewhere. Still odd that shutting
> down vm-db02 would trigger a migration of an unrelated VM.

look into resource stickiness.  Setting a default resource stickiness should prevent this.  It might be that shutting down vm-db02 some how meant that pacemaker decided to balance out the resources in a way that involved migrating the other vm.

-- Vossel

> (The fact that
> would also stop vm-swbuild is the known problem that constraints don't work
> well with migration.)
> 
> 
> 
> 
> On Tue, Jul 2, 2013 at 6:20 PM, David Vossel < dvossel at redhat.com > wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > To: "The Pacemaker cluster resource manager" <
> > pacemaker at oss.clusterlabs.org >
> > Sent: Tuesday, July 2, 2013 4:05:22 PM
> > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > 
> > Sorry for the delayed response, but I was out last week. I've applied this
> > patch to 1.1.10-rc5 and have been testing:
> > 
> 
> Thanks for testing :)
> 
> > 
> > 
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > --update "true"
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > scope=status name=service_postgresql value=true
> > # crm resource stop vm-db02
> > # crm resource start vm-db02
> > ### Wait a bit
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > scope=status name=service_postgresql value=(null)
> > Error performing operation: No such device or address
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > --update "true"
> > # crm_attribute --type status --node "db02" --name "service_postgresql"
> > scope=status name=service_postgresql value=true
> > 
> > Good so far. But now look at this (every node was clean, and all services
> > were running, before we started):
> > 
> > 
> > 
> > # crm status
> > Last updated: Tue Jul 2 16:15:14 2013
> > Last change: Tue Jul 2 16:15:12 2013 via crmd on cvmh02
> > Stack: cman
> > Current DC: cvmh02 - partition with quorum
> > Version: 1.1.10rc5-1.el6.ccni-2718638
> > 9 Nodes configured, unknown expected votes
> > 59 Resources configured.
> > 
> > 
> > Node db02: UNCLEAN (offline)
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > ldap02:vm-ldap02 ]
> > OFFLINE: [ swbuildsl6:vm-swbuildsl6 ]
> > 
> > Full list of resources:
> > 
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Stopped
> > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01
> > 
> > Not so good, and I'm not sure how to clean this up. I can't seem to stop
> 
> clean what up? I don't understand what I'm expected to notice out of place
> here?! The remote-node us up, everything looks happy.
> 
> > vm-db02 any more, even after I've entered:
> > 
> > 
> > 
> > # crm_node -R db02 --force
> 
> That won't stop the remote-node. 'crm resource stop vm-db02' should though.
> 
> > # crm resource start vm-db02
> 
> ha, I'm so confused. why are you trying to start it? I thought you were
> trying to stop the resource?
> 
> > 
> > 
> > 
> > ### Wait a bit
> > 
> > 
> > 
> > # crm status
> > Last updated: Tue Jul 2 16:32:38 2013
> > Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01
> > Stack: cman
> > Current DC: cvmh02 - partition with quorum
> > Version: 1.1.10rc5-1.el6.ccni-2718638
> > 8 Nodes configured, unknown expected votes
> > 54 Resources configured.
> > 
> > 
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02
> > swbuildsl6:vm-swbuildsl6 ]
> > OFFLINE: [ db02:vm-db02 ]
> > 
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh03
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01
> > 
> > My only recourse has been to reboot the cluster.
> > 
> > So let's do that and try
> > setting a location constraint on DummyOnVM, to force it on db02...
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Last updated: Tue Jul 2 16:43:46 2013
> > Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01
> > Stack: cman
> > Current DC: cvmh02 - partition with quorum
> > Version: 1.1.10rc5-1.el6.ccni-2718638
> > 8 Nodes configured, unknown expected votes
> > 54 Resources configured.
> > 
> > 
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> > 
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh03
> > 
> > # pcs constraint location DummyOnVM prefers db02
> > # crm status
> > ...
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> > ...
> > DummyOnVM (ocf::pacemaker:Dummy): Started db02
> > 
> > 
> > That's what we want to see. It would be interesting to stop db02. I expect
> > DummyOnVM to stop.
> 
> OH, okay, so you wanted DummyOnVM to start on db02.
> 
> > 
> > 
> > 
> > # crm resource stop vm-db02
> > # crm status
> > ...
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
> > OFFLINE: [ db02:vm-db02 swbuildsl6:vm-swbuildsl6 ]
> > ...
> > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh02
> > 
> > Failed actions:
> > vm-compute-test_migrate_from_0 (node=cvmh02, call=147, rc=1, status=Timed
> > Out, last-rc-change=Tue Jul 2 16:48:17 2013
> > , queued=20003ms, exec=0ms
> > ): unknown error
> > 
> > Well, that is odd. (It is the case that vm-swbuildsl6 has an order
> > dependency
> > on vm-compute-test, as I was trying to understand how migrations worked
> > with
> > order dependencies (not very well).
> 
> I don't think this failure has anything to do with the order dependencies. If
> pacemaker attempted to live migrate the vm and it fails, that's a resource
> problem. Do you have your virtual machine images on shared storage?
> 
> > Once vm-compute-test recovers,
> > vm-swbuildsl6 does come back up.) This isn't really very good -- if I am
> > running services in VM or other containers, I need them to run only in that
> > container!
> 
> Read about the differences between asymmetrical and symmetrical clusters. I
> think this will help this make sense. By default resources can run anywhere,
> you just gave more weight to db02 for the Dummy resource, meaning it prefers
> that node when it is around.
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on
> 
> 
> > 
> > If I start vm-db02 back up, I see that DummyOnVM is stopped and moved to
> > db02.
> 
> Yep, this is what I'd expect for a symmetrical cluster.
> 
> Thanks again for the feedback, hope the asymmetrical/symmetrical cluster
> stuff helps :)
> 
> -- Vossel
> 
> > 
> > 
> > On Thu, Jun 20, 2013 at 4:16 PM, David Vossel < dvossel at redhat.com > wrote:
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "David Vossel" < dvossel at redhat.com >
> > > To: "The Pacemaker cluster resource manager" <
> > > pacemaker at oss.clusterlabs.org >
> > > Sent: Thursday, June 20, 2013 1:35:44 PM
> > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > 
> > > ----- Original Message -----
> > > > From: "David Vossel" < dvossel at redhat.com >
> > > > To: "The Pacemaker cluster resource manager"
> > > > < pacemaker at oss.clusterlabs.org >
> > > > Sent: Wednesday, June 19, 2013 4:47:58 PM
> > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > > > > To: "The Pacemaker cluster resource manager"
> > > > > < Pacemaker at oss.clusterlabs.org >
> > > > > Sent: Wednesday, June 19, 2013 4:11:58 PM
> > > > > Subject: [Pacemaker] Pacemaker remote nodes, naming, and attributes
> > > > > 
> > > > > I built a set of rpms for pacemaker 1.1.0-rc4 and updated my test
> > > > > cluster
> > > > > (hopefully won't be a "test" cluster forever), as well as my VMs
> > > > > running
> > > > > pacemaker-remote. The OS everywhere is Scientific Linux 6.4. I am
> > > > > wanting
> > > > > to
> > > > > set some attributes on remote nodes, which I can use to control where
> > > > > services run.
> > > > > 
> > > > > The first deviation I note from the documentation is the naming of
> > > > > the
> > > > > remote
> > > > > nodes. I see:
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Last updated: Wed Jun 19 16:50:39 2013
> > > > > Last change: Wed Jun 19 16:19:53 2013 via cibadmin on cvmh04
> > > > > Stack: cman
> > > > > Current DC: cvmh02 - partition with quorum
> > > > > Version: 1.1.10rc4-1.el6.ccni-d19719c
> > > > > 8 Nodes configured, unknown expected votes
> > > > > 49 Resources configured.
> > > > > 
> > > > > 
> > > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01
> > > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]
> > > > > 
> > > > > Full list of resources:
> > > > > 
> > > > > and so forth. The "remote-node" names are simply the hostname, so the
> > > > > vm-db02
> > > > > VirtualDomain resource has a remote-node name of db02. The "Pacemaker
> > > > > Remote" manual suggests this should be displayed as "db02", not
> > > > > "db02:vm-db02", although I can see how the latter format would be
> > > > > useful.
> > > > 
> > > > Yep, this got changed since the documentation was published. We wanted
> > > > people to be able to recognize which remote-node went with which
> > > > resource
> > > > easily.
> > > > 
> > > > > 
> > > > > So now let's set an attribute on this remote node. What name do I
> > > > > use?
> > > > > How
> > > > > about:
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > # crm_attribute --node "db02:vm-db02" \
> > > > > --name "service_postgresql" \
> > > > > --update "true"
> > > > > Could not map name=db02:vm-db02 to a UUID
> > > > > Please choose from one of the matches above and suppy the 'id' with
> > > > > --attr-id
> > > > > 
> > > > > Perhaps not the most informative output, but obviously it fails.
> > > > > Let's
> > > > > try
> > > > > the unqualified name:
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > # crm_attribute --node "db02" \
> > > > > --name "service_postgresql" \
> > > > > --update "true"
> > > > > Remote-nodes do not maintain permanent attributes,
> > > > > 'service_postgresql=true'
> > > > > will be removed after db02 reboots.
> > > > > Error setting service_postgresql=true (section=status,
> > > > > set=status-db02):
> > > > > No
> > > > > such device or address
> > > > > Error performing operation: No such device or address
> > > 
> > > I just tested this and ran into the same errors you did. Turns out this
> > > happens when the remote-node's status section is empty. If you start a
> > > resource on the node and then set the attribute it will work... obviously
> > > this is a bug. I'm working on a fix.
> > 
> > This should help with the attributes bit.
> > 
> > https://github.com/ClusterLabs/pacemaker/commit/26d34a9171bddae67c56ebd8c2513ea8fa770204
> > 
> > -- Vossel
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>