[Pacemaker] large cluster design questions

Fri Jan 6 13:22:33 CET 2012

Hi,

On Fri, Jan 6, 2012 at 1:10 PM, Christian Parpart <trapni at gentoo.org> wrote:
> Hey all,
>
> I am also about to evaluate whether or not Pacemaker+Corosync is the
> way to go for our
> infrastructure.
>
> We are currently having about 45 physical nodes (plus about 60 more
> virtual containers)
> with a statically historically grown setup of services.
>
> I am now to restructure this historically grown system into something
> clean and well
> maintainable with HA and scalability in mind (there is no hurry, we've
> some time to design it).
>
> So here is what we mainly have or will have:
>
> -> HAproxy (tcp/80, tcp/443, master + (hot) failover)
> -> http frontend server(s) (doing SSL and static files, in case of
> performance issues -> clone resource).
> -> Varnish (backend accelerator)
> -> HAproxy (load-balancing backend app)
> -> Rails (app nodes, clones)
> ----------------------------------------------------------------
> - sharded memcache cluster (5 nodes), no failover currently (memcache
> cannot replicate :( )
> - redis nodes
> - mysql (3 nodes: active master, master, slave)
> - Solr (1 master, 2 slaves)
> - resque (many nodes)
> - NFS file storage pool (master/slave DRBD + ext3 fs currently, want
> to use GFS2/OCFS2 however)
>
> Now, I read alot about ppl saying a pacemaker cluster should not
> exceed 16 nodes, and many
> others saying this statement is bullsh**. While I now feel more with
> the latter, I still want to know:
>
>    is it still wise to built up a single pacemaker/corosync driven
> cluster out of all the services above?

There was a question related to large cluster performance which might
worth reading [1]

As far as Pacemaker is concerned, it has no (theoretical) upper limit
on the number of resources it can handle [1] however the lower part of
the stack (messaging and membership) has that limit. With Corosync
IIRC it was ~32 nodes (max number, used for testing scenarios) and I
haven't seen anyone come forth and say they've achieved more than 32
nodes in a single cluster using Corosync.

Now comes the question, do you really need to have all of the nodes in
the same cluster? Because if the answer is yes you need to consider
the following:
- any kind of resource failure is managed by relaying the information
to the DC, which then makes the decisions and sends out the actions to
be taken.

-- roughly translated, this means that for any kind of failure, the DC
will be more heavily loaded than any other node with performing the
required actions. Consider that this node (in your scenario) will also
hold resources, which may be affected by the load. So you may want to
consider not allowing resources to run on the DC.

- working with resources (listing, modifying, etc.) from nodes that
are not the DC will create additional overhead (not to mention that
you will need to use something else other than the crm shell as for
such a large cluster it'll be a performance hit, and this is just for
listing resources)

- the network layer will have overhead just for CIB synchronization,
even though the process uses diff's, it's still a lot of traffic just
for this, which in turn affects the timeouts on Corosync, which you
tune but then you need to take into account the increased timeouts on
resources, so [more nodes] => [increased traffic] => [tune network
timeouts] => [tune individual timeouts per resource] => [are you
sure?] => [y/n]

- now to the above add STONITH (as you should) and then consider the
administrative overhead for the entire solution

The point is it's not feasible to put everything in one
Pacemaker+Corosync cluster, from many perspectives, even if
technically it could be done.

Of course, this is just my point of view on the matter, I would
recommend to have the full 45 nodes split into smaller clusters based
on purpose. The way services talk to one is relevant mostly at a
network layer, so knowing you can contact MySQL servers on an IP
address or a list of IP addresses is the same whether MySQL is in a
cluster or not, it's still about A contacting B more or less, I'm
trying to simplify the view of the matter.

>
> One question I also have, is, when pacemaker is managing your
> resources, and migrates
> one resource from one host (because this one went down) to another,
> then this service should
> be actually able to access all data on that node, too.
> Which leads to the assumption, that you have to install *everything*
> on every node, to be actually able
> to start anything anywhere (depending on where pacemaker is about to
> put it and the scores the admin
> has defined).

Yes, if you have 10 services available in a cluster, and you allow all
services to be started on any one node, all nodes must have all of the
10 services configured and functional.

That is why I suggested splitting the cluster by purpose, this way for
MySQL nodes you install and configure as necessary, but don't do the
same on the rest of the nodes.

One other thing, as I see it, you want an N-to-N cluster, with any one
service being able to run on any node and to failover to any node.
Consider all of the services that need coordinated access to data, now
consider any node in the cluster can possibly run that service, which
further along means that you need all the nodes to have access to the
same shared data, so you're talking about a GFS2/OCFS2 cluster
spanning 45 nodes. I know I have an knack on stating the obvious, but
people most of the time say one thing and think another, so when you
reply with what they say, then all of a sudden when someone else other
than you says it, it sheds a different light on the matter.

Bottom line, split the nodes into clusters that match a common purpose.

There's bound to be more input on this on the matter, this is just my opinion.

HTH,
Dan

[1] http://oss.clusterlabs.org/pipermail/pacemaker/2012-January/012639.html

>
> Many thanks for your thoughts on this,
> Christian.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Dan Frincu
CCNA, RHCE