[Pacemaker] [RFC] Automatic nodelist synchronization between corosync and pacemaker

Vladislav Bogdanov bubble at hoster-ok.com
Thu Feb 28 06:13:47 UTC 2013


28.02.2013 07:21, Andrew Beekhof wrote:
> On Tue, Feb 26, 2013 at 7:36 PM, Vladislav Bogdanov
> <bubble at hoster-ok.com> wrote:
>> 26.02.2013 11:10, Andrew Beekhof wrote:
>>> On Mon, Feb 18, 2013 at 6:18 PM, Vladislav Bogdanov
>>> <bubble at hoster-ok.com> wrote:
>>>> Hi Andrew, all,
>>>>
>>>> I had an idea last night, that it may be worth implementing
>>>> fully-dynamic cluster resize support in pacemaker,
>>>
>>> We already support nodes being added on the fly.  As soon as they show
>>> up in the membership we add them to the cib.
>>
>> Membership (runtime.totem.pg.mrp.srp.members) or nodelist (nodelist.node)?
> 
> To my knowledge, only one (first) gets updated at runtime.
> Even if nodelist.node could be updated dynamically, we'd have to poll
> or be prompted to find out.

It can, please see at the end of cmap_keys(8).
Please also see cmap_track_add(3) for CMAP_TRACK_PREFIX flag (and my
original message ;) ).

>>
>> I recall that when I migrated from corosync 1.4 to 2.0 (somewhere near
>> pacemaker 1.1.8 release time) and replaced old-style UDPU member list
>> with nodelist.node, I saw all nodes configured in that nodelist appeared
>> in a CIB. For me that was a regression, because with old-style config
>> (and corosync 1.4) CIB contained only nodes seen online (4 of 16).
> 
> That was a loophole that only worked when the entire cluster had been
> down and the <nodes> section was empty.

Aha, that is what I've been hit by.

> People filed bugs explicitly asking for that loophole to be closed
> because it was inconsistent with what the cluster did on every
> subsequent startup.

That is what I'm interested too. And what I propose should fix that too.

> 
>> That
>> would be OK if number of clone instances does not raise with that...
> 
> Why?  If clone-node-max=1, then you'll never have more than the number
> of active nodes - even if clone-max is greater.

Active (online) or known (existing in a <nodes> section)?
I've seen that as soon as node appears in <nodes> even in offline state,
new clone instance is allocated.

Also, on one cluster with post-1.1.7 with openais plugin I have 16 nodes
configured in totem.interface.members, but only three nodes in <nodes>
CIB section, And I'm able to allocate at least 8-9 instances of clones
with clone-max. I believe that pacemaker does not query
totem.interface.members directly with openais plugin, and
runtime.totem.pg.mrp.srp.members has only three nodes.
Did that behavior change recently?

> 
>>
>>
>>> For node removal we do require crm_node --remove.
>>>
>>> Is this not sufficient?
>>
>> I think it would be more straight-forward if there is only one origin of
>> membership information for entire cluster stack, so proposal is to
>> automatically remove node from CIB when it disappears from corosync
>> nodelist (due to removal by admin). That nodelist is not dynamic (read
>> from a config and then may be altered with cmapctl).
> 
> Ok, but there still needs to be a trigger.
> Otherwise we waste cycles continuously polling corosync for something
> that is probably never going to happen.

Please see above (cmap_track_add).

> 
> Btw. crm_node doesn't just remove the node from the cib, its existence
> is preserved in a number of caches which need to be purged.

That could be done in a cmap_track_add's callback function too I think.

> It could be possible to have crm_node also use the CMAP API to remove
> it from the running corosync, but something would still need to edit
> corosync.conf

Yes, that is to admin.
Btw I think more about scenario Fabio explains in votequorum(8) in
'allow_downscale' section - that is the one I'm interested in.

> 
> IIRC, pcs handles all three components (corosync.conf, CMAP, crm_node)
> as well as the "add" case.

Good to know. But, I'm not ready yet to switch to it.

> 
>> Of course, it is possible to use crm_node to remove node from CIB too
>> after it disappeared from corosync, but that is not as elegant as
>> automatic one IMHO. And, that should not be very difficult to implement.
>>
>>>
>>>> utilizing
>>>> possibilities CMAP and votequorum provide.
>>>>
>>>> Idea is to:
>>>> * Do not add nodes from nodelist to CIB if their join-count in cmap is
>>>> zero (but do not touch CIB nodes which exist in a nodelist and have zero
>>>> join-count in cmap).
>>>> * Install watches on a cmap nodelist.node and
>>>> runtime.totem.pg.mrp.srp.members subtrees (cmap_track_add).
>>>> * Add missing nodes to CIB as soon as they are both
>>>> ** defined in a nodelist
>>>> ** their join count becomes non-zero.
>>>> * Remove nodes from CIB when they are removed from a nodelist.
>>>
>>> From _a_ nodelist or _the_ (optional) corosync nodelist?
>>
>> From the nodelist.node subtree of CMAP tree.
>>
>>>
>>> Because removing a node from the cluster because it shut down is... an
>>> interesting idea.
>>
>> BTW even that could be possible if quorum.allow_downscale is enabled,
>> but requires much more thinking and probably more functionality from
>> corosync. I'm not ready to comment on that yet though.
> 
> "A node left but I still have quorum" is very different to "a node
> left... what node?".
> Also, what happens after you fence a node... do we forget about it too?

quorum.allow_downscale mandates that it is active only if node leaves
the cluster in a clean state.

But, from what I know, corosync does not remove node from a
nodelist.node neither itself nor on request from votequorum, that's why
I say about "more functionality from corosync".
If votequorum could distinguish "static" node (listed in config) from
"dynamic" node (added on-the-fly), and manage list of "dynamic" ones if
allow_downscale is enabled, that would do the trick.

> 
>>
>> I was about node removal from a CMAP's nodelist with corosync_cmapctl
>> command. Of course, absence of (optional) nodelist in CMAP would result
>> in NOOP because there is no removal event on a nodelist.node tree from cmap.
>>
>>>
>>>> Certainly, this requires some CMAP values (especially votequorum ones
>>>> and may be totem mode) to have some 'well-known' values, f.e. only UDPU
>>>> mode and quorum.allow_downscale=1, that should be defined yet.
>>>>
>>>> May be, it also have sense to make this depend on some new CMAP
>>>> variable, f.e. nodelist.dynamic=1.
>>>>
>>>> I would even try to implement this if general agreement is gained and
>>>> nobody else wants to implement this.
>>>>
>>>> Can you please comment on this?
>>>>
>>>> Vladislav





More information about the Pacemaker mailing list