[ClusterLabs] Doing reload right

Wed Jul 20 20:32:25 EDT 2016

On Thu, Jul 21, 2016 at 2:47 AM, Adam Spiers <aspiers at suse.com> wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> Hello all,
>>
>> I've been meaning to address the implementation of "reload" in Pacemaker
>> for a while now, and I think the next release will be a good time, as it
>> seems to be coming up more frequently.
>
> [snipped]
>
> I don't want to comment directly on any of the excellent points which
> have been raised in this thread, but it seems like a good time to make
> a plea for easier reload / restart of individual instances of cloned
> services, one node at a time.  Currently, if nodes are all managed by
> a configuration management system (such as Chef in our case),

Puppet creates the same kinds of issues.
Both seem designed for a magical world full of unrelated servers that
require no co-ordination to update.
Particularly when the timing of an update to some central store (cib,
database, whatever) needs to be carefully ordered.

When you say "restart" though, is that a traditional stop/start cycle
in Pacemaker that also results in all the dependancies being stopped
too?
I'm guessing you really want the "atomic reload" kind where nothing
else is affected because we already have the other style covered by
crm_resource --restart.

I propose that we introduce a --force-restart option for crm_resource which:

1. disables any recurring monitor operations
2. calls a native restart action directly on the resource if it
exists, otherwise calls the native stop+start actions
3. re-enables the recurring monitor operations regardless of whether
the reload succeeds, fails, or times out, etc

No maintenance mode required, and whatever state the resource ends up
in is re-detected by the cluster in step 3.

> when the
> system wants to perform a configuration run on that node (e.g. when
> updating a service's configuration file from a template), it is
> necessary to place the entire node in maintenance mode before
> reloading or restarting that service on that node.  It works OK, but
> can result in ugly effects such as the node getting stuck in
> maintenance mode if the chef-client run failed, without any easy way
> to track down the original cause.
>
> I went through several design iterations before settling on this
> approach, and they are detailed in a lengthy comment here, which may
> help you better understand the challenges we encountered:
>
>   https://github.com/crowbar/crowbar-ha/blob/master/chef/cookbooks/crowbar-pacemaker/providers/service.rb#L61
>
> Similar challenges are posed during upgrade of Pacemaker-managed
> OpenStack infrastructure.
>
> Cheers,
> Adam
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org