[Pacemaker] [RFC] Automatic nodelist synchronization between corosync and pacemaker

Mon Mar 11 23:51:09 EDT 2013

12.03.2013 04:44, Andrew Beekhof wrote:
> On Thu, Mar 7, 2013 at 5:30 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 07.03.2013 03:37, Andrew Beekhof wrote:
>>> On Thu, Mar 7, 2013 at 2:41 AM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>> 06.03.2013 08:35, Andrew Beekhof wrote:
>>>
>>>>> So basically, you want to be able to add/remove nodes from nodelist.*
>>>>> in corosync.conf and have pacemaker automatically add/remove them from
>>>>> itself?
>>>>
>>>> Not corosync.conf, but cmap which is initially (partially) filled with
>>>> values from corosync.conf.
>>>>
>>>>>
>>>>> If corosync.conf gets out of sync (admin error or maybe a node was
>>>>> down when you updated last) they might well get added back - I assume
>>>>> you're ok with that?
>>>>> Because there's no real way to know the difference between "added
>>>>> back" and "not removed from last time".
>>>>
>>>> Sorry, can you please reword?
>>>
>>> When node-A comes up with "node-X" that no-one else has, the cluster
>>> has no way to know if node-X was just added, or if the admin forgot to
>>> remove it on node-A.
>>
>> Exactly that is not problem if node does not appear in CIB until it is
>> seen online.
> 
> But that is at odds with the "read the corosync nodelist" part.

But not with
* Do not add nodes from nodelist to CIB if their join-count in cmap is
zero (but do not touch CIB nodes which exist in a nodelist and have zero
join-count in cmap).

For the rest - I got your position on node deletion. It is not very
common to see so many words from you, so, thank you for interesting
discussion ;)

> 
>> If node-A comes up, then it is just booted, and that means
>> that it didn't see node-X online yet (if it is not actually online of
>> course). And then node-X is not added to CIB.
>>
>>>
>>>>> Or are you planning to never update the on-disk corosync.conf and only
>>>>> modify the in-memory nodelist?
>>>>
>>>> That depends on the actual use case I think.
>>>>
>>>> Hm. Interesting, how corosync behave when new dynamic nodes are added to
>>>> cluster... I mean following: we have static corosync.conf with nodelist
>>>> containing f.e. 3 entries, then we add fourth entry via cmap and boot
>>>> fourth node. What should be in corosync.conf of that node?
>>>
>>> I don't know actually.  Try it and see if it works without the local
>>> node being defined?
>>>
>>>> I believe in
>>>> wont work without that _its_ fourth entry. Ugh. If so, then no fully
>>>> dynamic "elastic" cluster which I was dreaming of is still possible
>>>> because out-of-the-box when using dynamic nodelist.
>>>>
>>>> The only way to have this I see is to have static nodelist in
>>>> corosync.conf with all possible nodes predefined. And never edit it in
>>>> cmap. So, my original point
>>>> * Remove nodes from CIB when they are removed from a nodelist.
>>>> does not fit.
>>>>
>>>> By elastic I mean what was discussed on corosync list when Fabio started
>>>> with votequorum design and what then appeared in votequorum manpage:
>>>> ===========
>>>> allow_downscale: 1
>>>>
>>>> Enables allow downscale (AD) feature (default: 0).
>>>>
>>>> The general behaviour of votequorum is to never decrease expected votes
>>>> or quorum.
>>>>
>>>> When  AD  is  enabled,  both expected votes and quorum are recalculated
>>>> when a node leaves the cluster in a clean state (normal corosync  shut-
>>>> down process) down to configured expected_votes.
>>>
>>> But thats very different to removing the node completely.
>>> You still want to know its in a sane state.
>>
>> Isn't it enough to trust corosync here?
> 
> Absolutely not.
> "Clean" to corosync means "did corosync on that node send me a message
> to say that they planned to exit".
> That implies nothing at all about the state of pacemaker or, more
> importantly, the cluster resources on that machine.
> 
> In addition, "clean" is an internal corosync concept that is not
> reported to clients like pacemaker.
> 
>> Of course if it supplies some event that "node X leaved cluster in a
>> clean state and we lowered expected_votes and quorum.
>>
>> Clean corosync shutdown means that either 'no more corosync clients
>> remain and it was safe to shutdown' or 'corosync has a bug'.
>> Pacemaker is corosync client, and corosync should not stop in a clean
>> state if pacemaker is still running there.
>>
>> And 'pacemaker is not running on node X' means that pacemaker instances
>> on other nodes accepted that. Otherwise node is scheduled to stonith and
>> there is no 'clean' shutdown.
>>
>> Am I correct here?
> 
> Not really, no.
> 
>>>
>>>> Example use case:
>>>>
>>>> 1) N node cluster (where N is any value higher than 3)
>>>> 2) expected_votes set to 3 in corosync.conf
>>>> 3) only 3 nodes are running
>>>> 4) admin requires to increase processing power and adds 10 nodes
>>>> 5) internal expected_votes is automatically set to 13
>>>> 6) minimum expected_votes is 3 (from configuration)
>>>> - up to this point this is standard votequorum behavior -
>>>> 7) once the work is done, admin wants to remove nodes from the cluster
>>>> 8) using an ordered shutdown the admin can reduce the cluster size
>>>>    automatically back to 3, but not below 3, where normal quorum
>>>>    operation will work as usual.
>>>> =============
>>>>
>>>> What I would expect from pacemaker, is to automatically remove nodes
>>>> down to 3 at step 8 (just follow quorum) if AD is enabled AND pacemaker
>>>> is instructed to follow that (with some other cmap switch). And also to
>>>> reduce number of allocated clone instances. Sure, all nodes must have
>>>> equal number of votes (1).
>>>>
>>>> Is it ok for you?
>>>
>>> Not really.
>>> We simply don't have enough information to do the removal.
>>> All we get is "node gone", we have to do a fair bit of work to
>>> calculate if it was clean at the time or not (and clean to corosync
>>> doesn't always imply clean to pacemaker).
>>
>> Please see above.
>> There is always (at least with mcp model) some time frame between
>> pacemaker stop and corosync stop events.
> 
> Relying on this would be a world of hurt.
> I have enough trouble dealing with timing issues to ever consider
> relying on one.
> 
> Nodes routinely reappear in virtualised clusters before stonith
> reports they have been fenced.
> Or peers that show up in the CPG membership 5 minutes before appearing
> in the cman/quorum membership.
> 
>> And pacemaker should accept
>> "node leave" after first one (doesn't it mark node as 'pending' in that
>> state?). And second event (corosync stop) is enough to remove that
>> 'pending' node I think.
>>
>> And, I think there should be some special message from corosync (or
>> better from votequorum)
> 
> Please. Stop. votequorum excluding a node's vote is completely
> different to removing all trace of the node from the cluster.
> One is not a suitable trigger for the other.
> 
>> that 'I lowered quorum and expected_votes
>> accepting clean leave of node X'. Otherwise we'd have problems when two
>> node leave at the same time, but one is clean while other is not. I hope
>> guys who maintain corosync will accept patch with that if it is not
>> possible to get similar message with current implementation.
>>
>>>
>>> So back to the start, why do you need pacemaker to forget about the
>>> other 10 nodes?
>>
>> Mostly aesthetic reasons.
> 
> This is a massive amount of pain and complexity to create for the
> "aesthetic reasons" of (AFAICS) a very uncommon use-case.
> Why not simply set value-- for clone-max as part of the shutdown
> procedure for your extra nodes?
> 
> (The cib actually understands value="value--")
> 
>> I think about using pacemaker as a backend for private (or semi-private)
>> virtualization clouds with varying load. F.e. 10 nodes, 3 of then should
>> run always, remaining 7 are only powered on when peak load is requested
>> (some heavy calculations should be done as fast as possible). Power is
>> not cheap today, so I'd prefer to have that nodes powered off when they
>> are not needed. That peak load is required only 3 days per month.
>> So I want cluster to "breath" accordingly with user needs.
>>
>> I agree that nothing hampers that to be in case right now, and that will
>> work, but f.e. crm_mon output will show all allocated but stopped clone
>> instances, and it would be much harder for admin to quickly find real
>> problems.
>>
>> Please look:
>> * No extra nodes in CIB, no extra clone instances
>> ** Everything is good:
>>
>>  Clone Set: cl-vlan200-if [vlan200-if]
>>      Started: [ v03-b v03-a v03-c ]
>>
>> ** Real problem, cl-vlan200-if is not running on v03-c
>>
>>  Clone Set: cl-vlan200-if [vlan200-if]
>>      Started: [ v03-b v03-a ]
>>      Stopped: [ vlan200-if:2 ]
>>
>> * Extra nodes in CIB, extra clone instances
>> ** Everything is good:
>>
>>  Clone Set: cl-vlan200-if [vlan200-if]
>>      Started: [ v03-b v03-a v03-c ]
>>      Stopped: [ vlan200-if:3 vlan200-if:4 vlan200-if:5 vlan200-if:6
>> vlan200-if:7 ]
>>
>> ** Real problem, cl-vlan200-if is not running on v03-c
>>
>>  Clone Set: cl-vlan200-if [vlan200-if]
>>      Started: [ v03-b v03-a ]
>>      Stopped: [ vlan200-if:2 vlan200-if:3 vlan200-if:4 vlan200-if:5
>> vlan200-if:6 vlan200-if:7 ]
>>
>> Do last two differ much, especially if there are dozens of such lines?
>> I think no.
>>
>> But admin quickly sees problem in the first case. And, that is probably
>> related not only to crm_mon, but other management tools as well, because
>> there is mostly no generic way to distinguish between 'correctly
>> not-running clone instance' and 'accidentally not running instance'
>> without involving extra logic which complicates management tools a lot.
>>
>> Does this make clearer why I what that all to be implemented?
>>
>>> (because everything apart from that should already work).
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> That
>>>>>>>> would be OK if number of clone instances does not raise with that...
>>>>>>>
>>>>>>> Why?  If clone-node-max=1, then you'll never have more than the number
>>>>>>> of active nodes - even if clone-max is greater.
>>>>>>
>>>>>> Active (online) or known (existing in a <nodes> section)?
>>>>>> I've seen that as soon as node appears in <nodes> even in offline state,
>>>>>> new clone instance is allocated.
>>>>>
>>>>> $num_known instances will "exist", but only $num_active will be running.
>>>>
>>>> Yep, that's what I say. I see them in crm_mon or 'crm status' and they
>>>> make my life harder ;)
>>>> That remaining instances are "allocated" but not running.
>>>>
>>>> I can agree that this issue is very "cosmetic" one, but its existence
>>>> conflicts with my perfectionism so I'd like to resolve it ;)
>>>>
>>>>>
>>>>>>
>>>>>> Also, on one cluster with post-1.1.7 with openais plugin I have 16 nodes
>>>>>> configured in totem.interface.members, but only three nodes in <nodes>
>>>>>> CIB section, And I'm able to allocate at least 8-9 instances of clones
>>>>>> with clone-max.
>>>>>
>>>>> Yes, but did you set clone-node-max?  One is the global maximum, the
>>>>> other is the per-node maximum.
>>>>>
>>>>>> I believe that pacemaker does not query
>>>>>> totem.interface.members directly with openais plugin,
>>>>>
>>>>> Correct.
>>>>>
>>>>>> and
>>>>>> runtime.totem.pg.mrp.srp.members has only three nodes.
>>>>>> Did that behavior change recently?
>>>>>
>>>>> No.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> For node removal we do require crm_node --remove.
>>>>>>>>>
>>>>>>>>> Is this not sufficient?
>>>>>>>>
>>>>>>>> I think it would be more straight-forward if there is only one origin of
>>>>>>>> membership information for entire cluster stack, so proposal is to
>>>>>>>> automatically remove node from CIB when it disappears from corosync
>>>>>>>> nodelist (due to removal by admin). That nodelist is not dynamic (read
>>>>>>>> from a config and then may be altered with cmapctl).
>>>>>>>
>>>>>>> Ok, but there still needs to be a trigger.
>>>>>>> Otherwise we waste cycles continuously polling corosync for something
>>>>>>> that is probably never going to happen.
>>>>>>
>>>>>> Please see above (cmap_track_add).
>>>>>>
>>>>>>>
>>>>>>> Btw. crm_node doesn't just remove the node from the cib, its existence
>>>>>>> is preserved in a number of caches which need to be purged.
>>>>>>
>>>>>> That could be done in a cmap_track_add's callback function too I think.
>>>>>>
>>>>>>> It could be possible to have crm_node also use the CMAP API to remove
>>>>>>> it from the running corosync, but something would still need to edit
>>>>>>> corosync.conf
>>>>>>
>>>>>> Yes, that is to admin.
>>>>>> Btw I think more about scenario Fabio explains in votequorum(8) in
>>>>>> 'allow_downscale' section - that is the one I'm interested in.
>>>>>>
>>>>>>>
>>>>>>> IIRC, pcs handles all three components (corosync.conf, CMAP, crm_node)
>>>>>>> as well as the "add" case.
>>>>>>
>>>>>> Good to know. But, I'm not ready yet to switch to it.
>>>>>>
>>>>>>>
>>>>>>>> Of course, it is possible to use crm_node to remove node from CIB too
>>>>>>>> after it disappeared from corosync, but that is not as elegant as
>>>>>>>> automatic one IMHO. And, that should not be very difficult to implement.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> utilizing
>>>>>>>>>> possibilities CMAP and votequorum provide.
>>>>>>>>>>
>>>>>>>>>> Idea is to:
>>>>>>>>>> * Do not add nodes from nodelist to CIB if their join-count in cmap is
>>>>>>>>>> zero (but do not touch CIB nodes which exist in a nodelist and have zero
>>>>>>>>>> join-count in cmap).
>>>>>>>>>> * Install watches on a cmap nodelist.node and
>>>>>>>>>> runtime.totem.pg.mrp.srp.members subtrees (cmap_track_add).
>>>>>>>>>> * Add missing nodes to CIB as soon as they are both
>>>>>>>>>> ** defined in a nodelist
>>>>>>>>>> ** their join count becomes non-zero.
>>>>>>>>>> * Remove nodes from CIB when they are removed from a nodelist.
>>>>>>>>>
>>>>>>>>> From _a_ nodelist or _the_ (optional) corosync nodelist?
>>>>>>>>
>>>>>>>> From the nodelist.node subtree of CMAP tree.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Because removing a node from the cluster because it shut down is... an
>>>>>>>>> interesting idea.
>>>>>>>>
>>>>>>>> BTW even that could be possible if quorum.allow_downscale is enabled,
>>>>>>>> but requires much more thinking and probably more functionality from
>>>>>>>> corosync. I'm not ready to comment on that yet though.
>>>>>>>
>>>>>>> "A node left but I still have quorum" is very different to "a node
>>>>>>> left... what node?".
>>>>>>> Also, what happens after you fence a node... do we forget about it too?
>>>>>>
>>>>>> quorum.allow_downscale mandates that it is active only if node leaves
>>>>>> the cluster in a clean state.
>>>>>>
>>>>>> But, from what I know, corosync does not remove node from a
>>>>>> nodelist.node neither itself nor on request from votequorum, that's why
>>>>>> I say about "more functionality from corosync".
>>>>>> If votequorum could distinguish "static" node (listed in config) from
>>>>>> "dynamic" node (added on-the-fly), and manage list of "dynamic" ones if
>>>>>> allow_downscale is enabled, that would do the trick.
>>>>>
>>>>> I doubt that request would get very far, but I could be wrong.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I was about node removal from a CMAP's nodelist with corosync_cmapctl
>>>>>>>> command. Of course, absence of (optional) nodelist in CMAP would result
>>>>>>>> in NOOP because there is no removal event on a nodelist.node tree from cmap.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Certainly, this requires some CMAP values (especially votequorum ones
>>>>>>>>>> and may be totem mode) to have some 'well-known' values, f.e. only UDPU
>>>>>>>>>> mode and quorum.allow_downscale=1, that should be defined yet.
>>>>>>>>>>
>>>>>>>>>> May be, it also have sense to make this depend on some new CMAP
>>>>>>>>>> variable, f.e. nodelist.dynamic=1.
>>>>>>>>>>
>>>>>>>>>> I would even try to implement this if general agreement is gained and
>>>>>>>>>> nobody else wants to implement this.
>>>>>>>>>>
>>>>>>>>>> Can you please comment on this?
>>>>>>>>>>
>>>>>>>>>> Vladislav
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>