[Pacemaker] [RFC] Automatic nodelist synchronization between corosync and pacemaker

Mon Mar 11 21:44:22 EDT 2013

On Thu, Mar 7, 2013 at 5:30 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
> 07.03.2013 03:37, Andrew Beekhof wrote:
>> On Thu, Mar 7, 2013 at 2:41 AM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>> 06.03.2013 08:35, Andrew Beekhof wrote:
>>
>>>> So basically, you want to be able to add/remove nodes from nodelist.*
>>>> in corosync.conf and have pacemaker automatically add/remove them from
>>>> itself?
>>>
>>> Not corosync.conf, but cmap which is initially (partially) filled with
>>> values from corosync.conf.
>>>
>>>>
>>>> If corosync.conf gets out of sync (admin error or maybe a node was
>>>> down when you updated last) they might well get added back - I assume
>>>> you're ok with that?
>>>> Because there's no real way to know the difference between "added
>>>> back" and "not removed from last time".
>>>
>>> Sorry, can you please reword?
>>
>> When node-A comes up with "node-X" that no-one else has, the cluster
>> has no way to know if node-X was just added, or if the admin forgot to
>> remove it on node-A.
>
> Exactly that is not problem if node does not appear in CIB until it is
> seen online.

But that is at odds with the "read the corosync nodelist" part.

> If node-A comes up, then it is just booted, and that means
> that it didn't see node-X online yet (if it is not actually online of
> course). And then node-X is not added to CIB.
>
>>
>>>> Or are you planning to never update the on-disk corosync.conf and only
>>>> modify the in-memory nodelist?
>>>
>>> That depends on the actual use case I think.
>>>
>>> Hm. Interesting, how corosync behave when new dynamic nodes are added to
>>> cluster... I mean following: we have static corosync.conf with nodelist
>>> containing f.e. 3 entries, then we add fourth entry via cmap and boot
>>> fourth node. What should be in corosync.conf of that node?
>>
>> I don't know actually.  Try it and see if it works without the local
>> node being defined?
>>
>>> I believe in
>>> wont work without that _its_ fourth entry. Ugh. If so, then no fully
>>> dynamic "elastic" cluster which I was dreaming of is still possible
>>> because out-of-the-box when using dynamic nodelist.
>>>
>>> The only way to have this I see is to have static nodelist in
>>> corosync.conf with all possible nodes predefined. And never edit it in
>>> cmap. So, my original point
>>> * Remove nodes from CIB when they are removed from a nodelist.
>>> does not fit.
>>>
>>> By elastic I mean what was discussed on corosync list when Fabio started
>>> with votequorum design and what then appeared in votequorum manpage:
>>> ===========
>>> allow_downscale: 1
>>>
>>> Enables allow downscale (AD) feature (default: 0).
>>>
>>> The general behaviour of votequorum is to never decrease expected votes
>>> or quorum.
>>>
>>> When  AD  is  enabled,  both expected votes and quorum are recalculated
>>> when a node leaves the cluster in a clean state (normal corosync  shut-
>>> down process) down to configured expected_votes.
>>
>> But thats very different to removing the node completely.
>> You still want to know its in a sane state.
>
> Isn't it enough to trust corosync here?

Absolutely not.
"Clean" to corosync means "did corosync on that node send me a message
to say that they planned to exit".
That implies nothing at all about the state of pacemaker or, more
importantly, the cluster resources on that machine.

In addition, "clean" is an internal corosync concept that is not
reported to clients like pacemaker.

> Of course if it supplies some event that "node X leaved cluster in a
> clean state and we lowered expected_votes and quorum.
>
> Clean corosync shutdown means that either 'no more corosync clients
> remain and it was safe to shutdown' or 'corosync has a bug'.
> Pacemaker is corosync client, and corosync should not stop in a clean
> state if pacemaker is still running there.
>
> And 'pacemaker is not running on node X' means that pacemaker instances
> on other nodes accepted that. Otherwise node is scheduled to stonith and
> there is no 'clean' shutdown.
>
> Am I correct here?

Not really, no.

>>
>>> Example use case:
>>>
>>> 1) N node cluster (where N is any value higher than 3)
>>> 2) expected_votes set to 3 in corosync.conf
>>> 3) only 3 nodes are running
>>> 4) admin requires to increase processing power and adds 10 nodes
>>> 5) internal expected_votes is automatically set to 13
>>> 6) minimum expected_votes is 3 (from configuration)
>>> - up to this point this is standard votequorum behavior -
>>> 7) once the work is done, admin wants to remove nodes from the cluster
>>> 8) using an ordered shutdown the admin can reduce the cluster size
>>>    automatically back to 3, but not below 3, where normal quorum
>>>    operation will work as usual.
>>> =============
>>>
>>> What I would expect from pacemaker, is to automatically remove nodes
>>> down to 3 at step 8 (just follow quorum) if AD is enabled AND pacemaker
>>> is instructed to follow that (with some other cmap switch). And also to
>>> reduce number of allocated clone instances. Sure, all nodes must have
>>> equal number of votes (1).
>>>
>>> Is it ok for you?
>>
>> Not really.
>> We simply don't have enough information to do the removal.
>> All we get is "node gone", we have to do a fair bit of work to
>> calculate if it was clean at the time or not (and clean to corosync
>> doesn't always imply clean to pacemaker).
>
> Please see above.
> There is always (at least with mcp model) some time frame between
> pacemaker stop and corosync stop events.

Relying on this would be a world of hurt.
I have enough trouble dealing with timing issues to ever consider
relying on one.

Nodes routinely reappear in virtualised clusters before stonith
reports they have been fenced.
Or peers that show up in the CPG membership 5 minutes before appearing
in the cman/quorum membership.

> And pacemaker should accept
> "node leave" after first one (doesn't it mark node as 'pending' in that
> state?). And second event (corosync stop) is enough to remove that
> 'pending' node I think.
>
> And, I think there should be some special message from corosync (or
> better from votequorum)

Please. Stop. votequorum excluding a node's vote is completely
different to removing all trace of the node from the cluster.
One is not a suitable trigger for the other.

> that 'I lowered quorum and expected_votes
> accepting clean leave of node X'. Otherwise we'd have problems when two
> node leave at the same time, but one is clean while other is not. I hope
> guys who maintain corosync will accept patch with that if it is not
> possible to get similar message with current implementation.
>
>>
>> So back to the start, why do you need pacemaker to forget about the
>> other 10 nodes?
>
> Mostly aesthetic reasons.

This is a massive amount of pain and complexity to create for the
"aesthetic reasons" of (AFAICS) a very uncommon use-case.
Why not simply set value-- for clone-max as part of the shutdown
procedure for your extra nodes?

(The cib actually understands value="value--")

> I think about using pacemaker as a backend for private (or semi-private)
> virtualization clouds with varying load. F.e. 10 nodes, 3 of then should
> run always, remaining 7 are only powered on when peak load is requested
> (some heavy calculations should be done as fast as possible). Power is
> not cheap today, so I'd prefer to have that nodes powered off when they
> are not needed. That peak load is required only 3 days per month.
> So I want cluster to "breath" accordingly with user needs.
>
> I agree that nothing hampers that to be in case right now, and that will
> work, but f.e. crm_mon output will show all allocated but stopped clone
> instances, and it would be much harder for admin to quickly find real
> problems.
>
> Please look:
> * No extra nodes in CIB, no extra clone instances
> ** Everything is good:
>
>  Clone Set: cl-vlan200-if [vlan200-if]
>      Started: [ v03-b v03-a v03-c ]
>
> ** Real problem, cl-vlan200-if is not running on v03-c
>
>  Clone Set: cl-vlan200-if [vlan200-if]
>      Started: [ v03-b v03-a ]
>      Stopped: [ vlan200-if:2 ]
>
> * Extra nodes in CIB, extra clone instances
> ** Everything is good:
>
>  Clone Set: cl-vlan200-if [vlan200-if]
>      Started: [ v03-b v03-a v03-c ]
>      Stopped: [ vlan200-if:3 vlan200-if:4 vlan200-if:5 vlan200-if:6
> vlan200-if:7 ]
>
> ** Real problem, cl-vlan200-if is not running on v03-c
>
>  Clone Set: cl-vlan200-if [vlan200-if]
>      Started: [ v03-b v03-a ]
>      Stopped: [ vlan200-if:2 vlan200-if:3 vlan200-if:4 vlan200-if:5
> vlan200-if:6 vlan200-if:7 ]
>
> Do last two differ much, especially if there are dozens of such lines?
> I think no.
>
> But admin quickly sees problem in the first case. And, that is probably
> related not only to crm_mon, but other management tools as well, because
> there is mostly no generic way to distinguish between 'correctly
> not-running clone instance' and 'accidentally not running instance'
> without involving extra logic which complicates management tools a lot.
>
> Does this make clearer why I what that all to be implemented?
>
>> (because everything apart from that should already work).
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> That
>>>>>>> would be OK if number of clone instances does not raise with that...
>>>>>>
>>>>>> Why?  If clone-node-max=1, then you'll never have more than the number
>>>>>> of active nodes - even if clone-max is greater.
>>>>>
>>>>> Active (online) or known (existing in a <nodes> section)?
>>>>> I've seen that as soon as node appears in <nodes> even in offline state,
>>>>> new clone instance is allocated.
>>>>
>>>> $num_known instances will "exist", but only $num_active will be running.
>>>
>>> Yep, that's what I say. I see them in crm_mon or 'crm status' and they
>>> make my life harder ;)
>>> That remaining instances are "allocated" but not running.
>>>
>>> I can agree that this issue is very "cosmetic" one, but its existence
>>> conflicts with my perfectionism so I'd like to resolve it ;)
>>>
>>>>
>>>>>
>>>>> Also, on one cluster with post-1.1.7 with openais plugin I have 16 nodes
>>>>> configured in totem.interface.members, but only three nodes in <nodes>
>>>>> CIB section, And I'm able to allocate at least 8-9 instances of clones
>>>>> with clone-max.
>>>>
>>>> Yes, but did you set clone-node-max?  One is the global maximum, the
>>>> other is the per-node maximum.
>>>>
>>>>> I believe that pacemaker does not query
>>>>> totem.interface.members directly with openais plugin,
>>>>
>>>> Correct.
>>>>
>>>>> and
>>>>> runtime.totem.pg.mrp.srp.members has only three nodes.
>>>>> Did that behavior change recently?
>>>>
>>>> No.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> For node removal we do require crm_node --remove.
>>>>>>>>
>>>>>>>> Is this not sufficient?
>>>>>>>
>>>>>>> I think it would be more straight-forward if there is only one origin of
>>>>>>> membership information for entire cluster stack, so proposal is to
>>>>>>> automatically remove node from CIB when it disappears from corosync
>>>>>>> nodelist (due to removal by admin). That nodelist is not dynamic (read
>>>>>>> from a config and then may be altered with cmapctl).
>>>>>>
>>>>>> Ok, but there still needs to be a trigger.
>>>>>> Otherwise we waste cycles continuously polling corosync for something
>>>>>> that is probably never going to happen.
>>>>>
>>>>> Please see above (cmap_track_add).
>>>>>
>>>>>>
>>>>>> Btw. crm_node doesn't just remove the node from the cib, its existence
>>>>>> is preserved in a number of caches which need to be purged.
>>>>>
>>>>> That could be done in a cmap_track_add's callback function too I think.
>>>>>
>>>>>> It could be possible to have crm_node also use the CMAP API to remove
>>>>>> it from the running corosync, but something would still need to edit
>>>>>> corosync.conf
>>>>>
>>>>> Yes, that is to admin.
>>>>> Btw I think more about scenario Fabio explains in votequorum(8) in
>>>>> 'allow_downscale' section - that is the one I'm interested in.
>>>>>
>>>>>>
>>>>>> IIRC, pcs handles all three components (corosync.conf, CMAP, crm_node)
>>>>>> as well as the "add" case.
>>>>>
>>>>> Good to know. But, I'm not ready yet to switch to it.
>>>>>
>>>>>>
>>>>>>> Of course, it is possible to use crm_node to remove node from CIB too
>>>>>>> after it disappeared from corosync, but that is not as elegant as
>>>>>>> automatic one IMHO. And, that should not be very difficult to implement.
>>>>>>>
>>>>>>>>
>>>>>>>>> utilizing
>>>>>>>>> possibilities CMAP and votequorum provide.
>>>>>>>>>
>>>>>>>>> Idea is to:
>>>>>>>>> * Do not add nodes from nodelist to CIB if their join-count in cmap is
>>>>>>>>> zero (but do not touch CIB nodes which exist in a nodelist and have zero
>>>>>>>>> join-count in cmap).
>>>>>>>>> * Install watches on a cmap nodelist.node and
>>>>>>>>> runtime.totem.pg.mrp.srp.members subtrees (cmap_track_add).
>>>>>>>>> * Add missing nodes to CIB as soon as they are both
>>>>>>>>> ** defined in a nodelist
>>>>>>>>> ** their join count becomes non-zero.
>>>>>>>>> * Remove nodes from CIB when they are removed from a nodelist.
>>>>>>>>
>>>>>>>> From _a_ nodelist or _the_ (optional) corosync nodelist?
>>>>>>>
>>>>>>> From the nodelist.node subtree of CMAP tree.
>>>>>>>
>>>>>>>>
>>>>>>>> Because removing a node from the cluster because it shut down is... an
>>>>>>>> interesting idea.
>>>>>>>
>>>>>>> BTW even that could be possible if quorum.allow_downscale is enabled,
>>>>>>> but requires much more thinking and probably more functionality from
>>>>>>> corosync. I'm not ready to comment on that yet though.
>>>>>>
>>>>>> "A node left but I still have quorum" is very different to "a node
>>>>>> left... what node?".
>>>>>> Also, what happens after you fence a node... do we forget about it too?
>>>>>
>>>>> quorum.allow_downscale mandates that it is active only if node leaves
>>>>> the cluster in a clean state.
>>>>>
>>>>> But, from what I know, corosync does not remove node from a
>>>>> nodelist.node neither itself nor on request from votequorum, that's why
>>>>> I say about "more functionality from corosync".
>>>>> If votequorum could distinguish "static" node (listed in config) from
>>>>> "dynamic" node (added on-the-fly), and manage list of "dynamic" ones if
>>>>> allow_downscale is enabled, that would do the trick.
>>>>
>>>> I doubt that request would get very far, but I could be wrong.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I was about node removal from a CMAP's nodelist with corosync_cmapctl
>>>>>>> command. Of course, absence of (optional) nodelist in CMAP would result
>>>>>>> in NOOP because there is no removal event on a nodelist.node tree from cmap.
>>>>>>>
>>>>>>>>
>>>>>>>>> Certainly, this requires some CMAP values (especially votequorum ones
>>>>>>>>> and may be totem mode) to have some 'well-known' values, f.e. only UDPU
>>>>>>>>> mode and quorum.allow_downscale=1, that should be defined yet.
>>>>>>>>>
>>>>>>>>> May be, it also have sense to make this depend on some new CMAP
>>>>>>>>> variable, f.e. nodelist.dynamic=1.
>>>>>>>>>
>>>>>>>>> I would even try to implement this if general agreement is gained and
>>>>>>>>> nobody else wants to implement this.
>>>>>>>>>
>>>>>>>>> Can you please comment on this?
>>>>>>>>>
>>>>>>>>> Vladislav
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org