[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Tue Apr 26 15:34:16 UTC 2011

Hi,

On 01/13/11 17:14, Lars Marowsky-Bree wrote:
> Hi all,
>
> sorry for the delay in posting this.
And sorry for the delay in replying this :-) I have some questions about 
this blow.

>
> IntroductioN: At LPC 2010, we discussed (once more) that a key feature
> for pacemaker in 2011 would be improved support for multi-site clusters;
> by multi-site, we mean two (or more) sites with a local cluster each,
Would the topology of such a multi-site deployment be indicated in cib 
configuration? Or it's just something corosync would need to care about?

And the cibs between different sites would still be synchronized? In 
other words, normally there would be only one DC among the sites, right?

> and some higher level entity coordinating fail-over across these (as
> opposed to "stretched" clusters, where a single cluster might spawn the
> whole campus in the city).
>
> Typically, such multi-site environments are also too far apart to
> support synchronous communication/replication.
>
> There are several aspects to this that we discussed; Andrew and I first
> described and wrote this out a few years ago, so I hope he can remember
> the rest ;-)
>
> "Tokens" are, essentially, cluster-wide attributes (similar to node
> attributes, just for the whole partition).
Specifically, a "<tokens>" section with an attribute set (
"<token_set>" or something) under "/cib/configuration"?

Should an admin grant a token to the cluster initially? Or grant it to 
several nodes which are supposed to be from a same site? Or grant it to 
a partition after a split-brain happens --  A split-brain can happen 
between the sites or inside a site. How could it be distinguished and 
what policies to handle the scenarios respectively? What if a partition 
split further?

Additionally, when a split-brain happens, how about the existing stonith 
mechanism. Should the partition without quorum be stonithed? If 
shouldn't, or if couldn't, should the partition elect a DC? What about 
the no-quorum-policy?

> Via dependencies (similar to
> rsc_location), one can specify that certain resources require a specific
> token to be set before being started
Which way do you prefer? I found you discussed this in another thread 
last year. The choices mentioned there as:
- A "<rsc_order>" with "Deadman" order-type specified:
   <rsc_order id="order-tokenA-rscX" first-token="tokenA" then="rscX" 
kind="Deadman"/>

- A "<rsc_colocation>":
   <rsc_colocation id="rscX-with-tokenA" rsc="rscX" with-token="tokenA" 
kind="Deadman"/>

Other choices I can imagine:

- There could be a "requires" field in an "op", which could be set to 
"quorum" or "fencing". Similarly, we could also introduce a 
"requires-token" field:

<op id="rscX-start" name="start" interval="0" requires-token="tokenA"/>

The shortcoming is a resource cannot depend on multiple tokens.

- A "<rsc_location>" with expressions:

   <rsc_location id="loc-rscX" rsc="rscX" kind="Deadman">
     <rule id="loc-rscX-rule-0">
       <expression id="expr-0" attribute="#tokenA" operation="eq" 
value="true"/>
     </rule>
   </rsc_location>

Via boolean-op, a resource can depend on multiple tokens, or any one of 
the specified multiple tokens.

- A completely new type of constraint:
   <rsc_token id="rscX-with-tokenA" rsc="rscX" token="tokenA" 
kind="Deadman"/>

> (and, vice versa, need to be
> stopped if the token is cleared). You could also think of our current
> "quorum" as a special, cluster-wide token that is granted in case of
> node majority.
>
> The token thus would be similar to a "site quorum"; i.e., the permission
> to manage/own resources associated with that site, which would be
> recorded in a rsc dependency. (It'd probably make a lot of sense if this
> would support resource sets,
If so, the "op" and the current "rsc_location" are not preferred.

> so one can easily list all the resources;
> also, some resources like m/s may tie their role to token ownership.)
>
> These tokens can be granted/revoked either manually (which I actually
> expect will be the default for the classic enterprise clusters), or via
> an automated mechanism described further below.
>
>
> Another aspect to site fail-over is recovery speed. A site can only
> activate the resources safely if it can be sure that the other site has
> deactivated them. Waiting for them to shutdown "cleanly" could incur
> very high latency (think "cascaded stop delays"). So, it would be
> desirable if this could be short-circuited. The idea between Andrew and
> myself was to introduce the concept of a "dead man" dependency; if the
> origin goes away,nodes which host dependent resources are fenced,
> immensely speeding up recovery.
Does the "origin" mean "token"? If so, isn't it supposed to be revoked 
manually by default? So the short-circuited fail-over needs an admin to 
participate?

BTW, Xinwei once suggested to treat "the token is not set" and "the 
token is set to no" differently. For the former, the behavior would be 
like the token dependencies don't exist. If the token is explicitly set, 
invoke the appropriate policies. Does that help to distinguish scenarios?

>
> It seems to make most sense to make this an attribute of some sort for
> the various dependencies that we already have, possibly, to make this
> generally available. (It may also be something admins want to
> temporarily disable - i.e., for a graceful switch-over, they may not
> want to trigger the dead man process always.)
Does it means an option for users to choose if they want an immediate 
fencing or stopping the resources normally? Is it global or particularly 
for a specific token , or even/just for a specific dependency?

>
>
> The next bit is what we called the "Cluster Token Registry"; for those
> scenarios where the site switch is supposed to be automatic (instead of
> the admin revoking the token somewhere, waiting for everything to stop,
> and then granting it on the desired site). The participating clusters
> would run a daemon/service that would connect to each other, exchange
> information on their connectivity details (though conceivably, not mere
> majority is relevant, but also current ownership, admin weights, time
> of day, capacity ...), and vote on which site gets which token(s); a
> token would only be granted to a site once they can be sure that it has
> been relinquished by the previous owner, which would need to be
> implemented via a timer in most scenarios (see the dead man flag).
>
> Further, sites which lose the vote (either explicitly or implicitly by
> being disconnected from the voting body) would obviously need to perform
> said release after a sane time-out (to protect against brief connection
> issues).
>
>
> A final component is an idea to ease administration and management of
> such environments. The dependencies allow an automated tool to identify
> which resources are affected by a given token, and this could be
> automatically replicated (and possibly transformed) between sites, to
> ensure that all sites have an uptodate configuration of relevant
> resources. This would be handled by yet another extension, a CIB
> replicator service (that would either run permanently or explicitly when
> the admin calls it).
>
> Conceivably, the "inactive" resources may not even be present in the
> active CIB of sites which don't own the token (and be inserted once
> token ownership is established). This may be an (optional) interesting
> feature to keep CIB sizes under control.
>
>
> Andrew, is that about what we discussed? Any comments from anyone else?
> Did I capture what we spoke about at LPC?
>
>
> Regards,
>      Lars
>

Regards,
   Yan
-- 
Yan Gao <ygao at novell.com>
Software Engineer
China Server Team, OPS Engineering, Novell, Inc.
  <javascript:void(0);>