[Pacemaker] Configuration recommandations for (very?) large cluster

Tue Aug 12 23:59:05 CEST 2014

----- Original Message -----
> On 12/08/14 07:52, Andrew Beekhof wrote:
> > On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute
> > <cedric.dufour at idiap.ch> wrote:
> >
> >> Hello,
> >>
> >> Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large
> >> cluster:
> > Thats certainly up there as one of the biggest :)
> 
> Well, actually, I sized it down from 444 to 277 resources by merging
> 'VirtualDomain' and 'MailTo' RA/primitives into a custom/single
> 'LibvirtQemu' one.
> CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the
> informational-only 'MailTo' RA to come burden the cluster)
> 'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the
> safe side.
> 
> Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high?

More system memory will be required for ipc connections. Unless you're running
low on ram, you should be fine with the buffer you set.

> 277 resources are:
>  - 22 (cloned) network-health (ping) resources
>  - 88 (cloned) stonith resources (I have 4 stonith devices)
>  - 167 LibvirtQemu resources (83 "general-purpose" servers and 84 SGE-driven
>  computation nodes)
> (and more LibvirtQemu resources are expected to come)
> 
> > Have you checked pacemaker's CPU usage during startup/failover?  I'd be
> > interested in your results.
> 
> I finally set  'batch-limit' set to 22 - the quantity of nodes - as it makes
> sense when enabling a new primitive, as all monitor operations get
> dispatched immediately to all nodes at once.
> 
> When bringing a standby node to life:
> 
>  - On the "waking" node (E5-2690v2): 167+5 resources monitoring operations
>  get dispatched; the CPU load of the 'cib' process remains below 100% as the
>  operations are executed, batched by 22 (though one can not see that
>  "batching", the monitoring operations succeeding very quickly), and
>  complete in ~2 seconds. With Pacemaker 1.1.7, the 'cib' load would have
>  peaked to 100% even before the first monitoring operation started (because
>  of the CIB refresh, I guess) and would remain so for several tens of
>  seconds (often resulting in timeouts and monitoring operations failure)
> 
>  - On the DC node (E5-2690v2): the CPU would also remain below 100%,
>  alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back
>  to IDLE within ~4 seconds.
> 
> I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at
> 100% while carrying out the same procedure, but all went well nonetheless.
> 
> While I still had the ~450 resources, I also "accidentally" brought all 22
> nodes back to life together (well, actually started the DC alone and then
> started the remaining 21 nodes together). As could be expected, the DC got
> quite busy (dispatching/executing the ~450*22 monitoring operations on all
> nodes). It took 40 minutes for the cluster to stabilize. But it did
> stabilize, with no timeout and not monitor operations failure! A few "high
> CIB load detected / throttle down mode" messages popped up but all went
> well.
> 
> Q: Is there a way to favorize more powerful nodes for the DC (iow. push the
> DC "election" process in a preferred direction) ?
> 
> >
> >> Last updated: Mon Aug 11 13:40:14 2014
> >> Last change: Mon Aug 11 13:37:55 2014
> >> Stack: classic openais (with plugin)
> > I would at least try running it with corosync 2.x (no plugin)
> > That will use CPG for messaging which should perform even better.
> 
> I'm running into a deadline now and will have to stick to 1.4.x for the
> moment. But as soon as I can free an old test Intel modular chassis I have
> around, I'll try backporting Coro 2.x from Debian/Experimental to
> Debian/Wheezy and see what gives.
> 
> >
> >> Current DC: bc1hx5a05 - partition with quorum
> >> Version: 1.1.12-561c4cf
> >> 22 Nodes configured, 22 expected votes
> >> 444 Resources configured
> >>
> >> PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict
> >> QoS priority over all other traffic.
> >>
> >> Are there recommended configuration tweaks I should not miss in such
> >> situation?
> >>
> >> So far, I have:
> >> - Raised the 'PCMK_ipc_buffer' size to 2MiB
> >> - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain
> >> the default 30)
> > Yep, definitely worth trying the higher value.
> > We _should_ automatically start throttling ourselves if things get too
> > intense.
> 
> Yep. As mentioned above, I did see "high CIB load detected / throttle down
> mode" messages popup. Is this what you think about?
> 
> >
> > Other than that, I would be making sure all the corosync.conf timeouts and
> > other settings are appropriate.
> 
> Never paid much attention to it so far. But it seems to me the Debian
> defaults are quite conservative, especially more so given my 10GbE (~0.2ms
> latency) interconnect and the care I took in prioritizing Corosync traffic
> (thanks to switches QoS/GMB and Linux 'tc'):
> 
>     token: 3000
>     token_retransmits_before_loss_const: 10
>     join: 60
>     consensus: 3600
>     vsftype: none
>     max_messages: 20
>     secauth: off
>     amf: disabled
> 
> Am I right?
> 
> PS: this work is being done within the concept of the BEAT european research
> project - https://www.beat-eu.org/ - which aims, among other things, to
> "develop an online and open platform to transparently and independently
> evaluate biometric systems against validated benchmarks". There shall be
> some "publication" about the infrastructure set up. If interested, I can
> keep you posted.
> 
> Best,
> 
> Cédric
> 
> >
> >> Thank you in advance for your response.
> >>
> >> Best,
> >>
> >> Cédric
> >>
> >> --
> >>
> >> Cédric Dufour @ Idiap Research Institute
> >>
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>