[Pacemaker] chicken-egg-problem with libvirtd and a VM within cluster

Tue Oct 16 08:56:46 UTC 2012

On Fri, Oct 12, 2012 at 6:22 PM, Florian Haas <florian at hastexo.com> wrote:
> On Fri, Oct 12, 2012 at 3:18 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> This has been a topic that has popped up occasionally over the years.
>> Unfortunately we still don't have a good answer for you.
>>
>> The "least worst" practice has been to have the RA return OCF_STOPPED
>> for non-recurring monitor operations (aka. startup probes) IFF its
>> pre-requistites (ie. binaries, or things that might be on a cluster
>> file system) are not available.
>>
>> Possibly we need to begin using the ordering constraints (normally
>> used for ordering start operations) for the startup probes too.
>> Ie. order(A, B) ==> A.start before B.(monitor_0, start)
>>
>> I had been resisting that move, but perhaps its time.
>>
>> (It would also help avoid slamming the cluster with a bazillion
>> operations in parallel when several nodes start up together)
>>
>> Lars? Florian? Comments?
>
> Sure. As Tom correctly observes, the problem (as I know it) occurs
> when manually stopping Pacemaker services and then restarting them.

Not really. It mostly occurs at cluster startup.
The most common scenario is when the probe of resource A depends on
data/tools made available by resource B (ie. shared storage).

> As
> it shuts down, Pacemaker kills libvirtd (after migrating off or
> stopping all VMs), and then as you bring it back up, the probe runs
> into an error. The same, btw, applies if you only send the node into
> standby mode.
>
> For manual intervention, the workaround is simply this:
>
> - Stop Pacemaker services, or put node in standby (libvirtd stops in
> the process as the local clone instance shuts down).
> - Do whatever you need to do on that box.
> - Start libvirtd.
> - Start Pacemaker services, or take node online.
>
> For most people, this issue doesn't occur on system boot, as libvirtd
> would normally start before corosync, or corosync/pacemaker isn't part
> of the system bootup sequence at all (the latter is preferred for
> two-node clusters to prevent fencing shootouts in case of cluster
> split brain).
>
> On that ha-kvm.pdf guide, I will add that I'm guessing this is not the
> only piece of information missing or outdated in it. However, I have
> no rights to that document other than to be named as an original
> author and to use it under CC-NC-ND terms like anyone else, and I have
> no access to the sources anymore, so there's no way for me to update
> it. Maybe the Linbit folks are willing/able to do that.
>
> Back on the probe issue, we're in a bit of a catch-22 als libvirtd can
> be freely restarted and stopped while leaving domains (VMs) running.
> So the assumption "if libvirtd doesn't run, then the domain can't be
> running" simply doesn't hold up. In fact, it's outright dangerous, as
> a domain may well run _and have read/write access to shared resources_
> while libvirt isn't running. So doing the naive thing and bail out of
> monitor if we can't detect a livirtd pid -- that doesn't fly.

Which sort of emphasises my point that we don't have a good (ie. one
that generically applies to all situations) answer yet.
Which is why I I'm thinking about ordered probes.

What about checking for the VM process in this specific scenario though?

>
> What would fly is to check for libvirtd on _every_ invocation of the
> RA (well, maybe all except validate and usage), and to restart it on
> the sole condition that we can't detect its pid. That, however, breaks
> the contract that a probe should be non-invasive and really shouldn't
> be touching any system services. Also, a running libvirtd is not
> needed, to the best of my knowledge, when the hypervisor in use is Xen
> rather than KVM. We could mitigate that by making it configurable, but
> the only sane default would be to have this enabled, which again
> breaks said contract.
>
> When virsh is invoked with a qemu:///session URI it will actually
> start up a user-specific libvirtd by itself, but as far as I know
> there is no way to do that for qemu:///system which most people will
> be using.
>
> Andrew, your suggestion would fix that issue, but it would obviously
> make the config more convoluted. In effect, we'd need one order and
> one colo constraint more than we already do.

Not in the shared FS case.  We're reusing existing "B before A" constraints.

However in this case, if you care about whether daemon X is running -
and you want pacemaker to enforce it, then you of course need to
define it as a resource and specify constraints.
We're not a crystal ball.

Of course if you want to handle it in the RA, thats fine by me :-)

> For a silly idea, how
> about thinking about being able to define a list of op types in a
> constraint, rather than a single op? As in:
>
> order libvirtd_before_virtdom inf: libvirtd:start virtdom_foo:monitor,start
> colocation virtdom_on_libvirtd inf: virtdom_foo:Started,Probed libvirtd:Started

Uh, no thanks :)

> (Of course no such thing as a "Probed" role currently exists, so here
> we go down the rabbit hole...)
>
> I hope this is useful. Thoughts are much appreciated.
>
> Cheers,
> Florian
>
> --
> Need help with High Availability?
> http://www.hastexo.com/now
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org