[Pacemaker] [RFC] Automatic nodelist synchronization between corosync and pacemaker

Wed Mar 6 00:35:00 EST 2013

On Thu, Feb 28, 2013 at 5:13 PM, Vladislav Bogdanov
<bubble at hoster-ok.com> wrote:
> 28.02.2013 07:21, Andrew Beekhof wrote:
>> On Tue, Feb 26, 2013 at 7:36 PM, Vladislav Bogdanov
>> <bubble at hoster-ok.com> wrote:
>>> 26.02.2013 11:10, Andrew Beekhof wrote:
>>>> On Mon, Feb 18, 2013 at 6:18 PM, Vladislav Bogdanov
>>>> <bubble at hoster-ok.com> wrote:
>>>>> Hi Andrew, all,
>>>>>
>>>>> I had an idea last night, that it may be worth implementing
>>>>> fully-dynamic cluster resize support in pacemaker,
>>>>
>>>> We already support nodes being added on the fly.  As soon as they show
>>>> up in the membership we add them to the cib.
>>>
>>> Membership (runtime.totem.pg.mrp.srp.members) or nodelist (nodelist.node)?
>>
>> To my knowledge, only one (first) gets updated at runtime.
>> Even if nodelist.node could be updated dynamically, we'd have to poll
>> or be prompted to find out.
>
> It can, please see at the end of cmap_keys(8).
> Please also see cmap_track_add(3) for CMAP_TRACK_PREFIX flag (and my
> original message ;) ).

ACK :)

>
>>>
>>> I recall that when I migrated from corosync 1.4 to 2.0 (somewhere near
>>> pacemaker 1.1.8 release time) and replaced old-style UDPU member list
>>> with nodelist.node, I saw all nodes configured in that nodelist appeared
>>> in a CIB. For me that was a regression, because with old-style config
>>> (and corosync 1.4) CIB contained only nodes seen online (4 of 16).
>>
>> That was a loophole that only worked when the entire cluster had been
>> down and the <nodes> section was empty.
>
> Aha, that is what I've been hit by.
>
>> People filed bugs explicitly asking for that loophole to be closed
>> because it was inconsistent with what the cluster did on every
>> subsequent startup.
>
> That is what I'm interested too. And what I propose should fix that too.

Ah, I must have misparsed, I thought you were looking for the opposite
behaviour.

So basically, you want to be able to add/remove nodes from nodelist.*
in corosync.conf and have pacemaker automatically add/remove them from
itself?

If corosync.conf gets out of sync (admin error or maybe a node was
down when you updated last) they might well get added back - I assume
you're ok with that?
Because there's no real way to know the difference between "added
back" and "not removed from last time".

Or are you planning to never update the on-disk corosync.conf and only
modify the in-memory nodelist?

>
>>
>>> That
>>> would be OK if number of clone instances does not raise with that...
>>
>> Why?  If clone-node-max=1, then you'll never have more than the number
>> of active nodes - even if clone-max is greater.
>
> Active (online) or known (existing in a <nodes> section)?
> I've seen that as soon as node appears in <nodes> even in offline state,
> new clone instance is allocated.

$num_known instances will "exist", but only $num_active will be running.

>
> Also, on one cluster with post-1.1.7 with openais plugin I have 16 nodes
> configured in totem.interface.members, but only three nodes in <nodes>
> CIB section, And I'm able to allocate at least 8-9 instances of clones
> with clone-max.

Yes, but did you set clone-node-max?  One is the global maximum, the
other is the per-node maximum.

> I believe that pacemaker does not query
> totem.interface.members directly with openais plugin,

Correct.

> and
> runtime.totem.pg.mrp.srp.members has only three nodes.
> Did that behavior change recently?

No.

>
>>
>>>
>>>
>>>> For node removal we do require crm_node --remove.
>>>>
>>>> Is this not sufficient?
>>>
>>> I think it would be more straight-forward if there is only one origin of
>>> membership information for entire cluster stack, so proposal is to
>>> automatically remove node from CIB when it disappears from corosync
>>> nodelist (due to removal by admin). That nodelist is not dynamic (read
>>> from a config and then may be altered with cmapctl).
>>
>> Ok, but there still needs to be a trigger.
>> Otherwise we waste cycles continuously polling corosync for something
>> that is probably never going to happen.
>
> Please see above (cmap_track_add).
>
>>
>> Btw. crm_node doesn't just remove the node from the cib, its existence
>> is preserved in a number of caches which need to be purged.
>
> That could be done in a cmap_track_add's callback function too I think.
>
>> It could be possible to have crm_node also use the CMAP API to remove
>> it from the running corosync, but something would still need to edit
>> corosync.conf
>
> Yes, that is to admin.
> Btw I think more about scenario Fabio explains in votequorum(8) in
> 'allow_downscale' section - that is the one I'm interested in.
>
>>
>> IIRC, pcs handles all three components (corosync.conf, CMAP, crm_node)
>> as well as the "add" case.
>
> Good to know. But, I'm not ready yet to switch to it.
>
>>
>>> Of course, it is possible to use crm_node to remove node from CIB too
>>> after it disappeared from corosync, but that is not as elegant as
>>> automatic one IMHO. And, that should not be very difficult to implement.
>>>
>>>>
>>>>> utilizing
>>>>> possibilities CMAP and votequorum provide.
>>>>>
>>>>> Idea is to:
>>>>> * Do not add nodes from nodelist to CIB if their join-count in cmap is
>>>>> zero (but do not touch CIB nodes which exist in a nodelist and have zero
>>>>> join-count in cmap).
>>>>> * Install watches on a cmap nodelist.node and
>>>>> runtime.totem.pg.mrp.srp.members subtrees (cmap_track_add).
>>>>> * Add missing nodes to CIB as soon as they are both
>>>>> ** defined in a nodelist
>>>>> ** their join count becomes non-zero.
>>>>> * Remove nodes from CIB when they are removed from a nodelist.
>>>>
>>>> From _a_ nodelist or _the_ (optional) corosync nodelist?
>>>
>>> From the nodelist.node subtree of CMAP tree.
>>>
>>>>
>>>> Because removing a node from the cluster because it shut down is... an
>>>> interesting idea.
>>>
>>> BTW even that could be possible if quorum.allow_downscale is enabled,
>>> but requires much more thinking and probably more functionality from
>>> corosync. I'm not ready to comment on that yet though.
>>
>> "A node left but I still have quorum" is very different to "a node
>> left... what node?".
>> Also, what happens after you fence a node... do we forget about it too?
>
> quorum.allow_downscale mandates that it is active only if node leaves
> the cluster in a clean state.
>
> But, from what I know, corosync does not remove node from a
> nodelist.node neither itself nor on request from votequorum, that's why
> I say about "more functionality from corosync".
> If votequorum could distinguish "static" node (listed in config) from
> "dynamic" node (added on-the-fly), and manage list of "dynamic" ones if
> allow_downscale is enabled, that would do the trick.

I doubt that request would get very far, but I could be wrong.

>
>>
>>>
>>> I was about node removal from a CMAP's nodelist with corosync_cmapctl
>>> command. Of course, absence of (optional) nodelist in CMAP would result
>>> in NOOP because there is no removal event on a nodelist.node tree from cmap.
>>>
>>>>
>>>>> Certainly, this requires some CMAP values (especially votequorum ones
>>>>> and may be totem mode) to have some 'well-known' values, f.e. only UDPU
>>>>> mode and quorum.allow_downscale=1, that should be defined yet.
>>>>>
>>>>> May be, it also have sense to make this depend on some new CMAP
>>>>> variable, f.e. nodelist.dynamic=1.
>>>>>
>>>>> I would even try to implement this if general agreement is gained and
>>>>> nobody else wants to implement this.
>>>>>
>>>>> Can you please comment on this?
>>>>>
>>>>> Vladislav
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org