[Pacemaker] Announce: Making Resource Utilization Dynamic

Wed Jun 12 05:01:18 EDT 2013

On 2013-06-05T20:44:56, Michael Schwartzkopff <misch at clusterbau.com> wrote:

Hi Michael,

yes, the idea to make utilization more dynamic was something Andrew and
I looked into ages ago.

Especially, there's still the open issue that it somewhat sucks that one
has to configure them at all. It'd be nice if monitor_0 would "discover"
the memory/CPU values from the VM (for example) and populate the CIB
accordingly. And to keep those in-sync.

Pacemaker is not necessarily the best tool to implement quick reaction
to changing load, though. The utilization feature is concerned with
*correctness* first - namely, don't overcommit resources severely, e.g.,
the case of Xen/VMs in general, don't overcommit physical memory
(which could even prevent resources from starting at all), or making
sure there's at least 0.5 CPU cores available per VM, etc. 

Without having the admin having to figure out the node scores
manually. Ease of configuration and all that.

Some constructive feedback:

The dampening in your approach isn't sufficient. This could potentially
cause a reshuffling of resources with every update; even taking into
account that this is possible using live migration, it's going to be a
major performance impact.

I think what you want instead are thresholds; only if the resource
utilization stays above XXX for YYY times, update the CIB, so that the
service can be moved to a more powerful server. Lower requirements again
if the system is below a fall-back threshold for a given period. You
want to minimize movement. And to add a scale factor so you can allow
for some overcommit if desired. [*]

You also want that because you want to avoid needless PE runs. In your
current example, you're going to cause a PE run for *every* *single*
monitor operation on *any* VM.

And, of course, this should be optional and protected via a
configuration parameter.

But, the real issue: CPU utilization raising is only a problem if the
service performance suffers in turn. Basically, you don't want to move
resources because their CPU utilization rises, but when the performance
of the services hosted on a node degrade.

Hence, I'd agree that the dynamic load adjustment best should live
outside Pacemaker. At the very least, you'd want to synchronize updating
the load factors of all the VMs at once, so that the PE can shuffle them
once, not repeatedly.

While the data gathering (* as outlined above) could happen in the RA, I
think you need to involve at least something like attrd in dampening
them. You don't want each RA to implement a threshold/stepping logic
independently.

Note that all our normal probes - including the nagios ones - are
concerned with a "healthy"/"failed" dichotomoy only too. They don't
really offer SLA/response time data, short of 'well duh I timed out'.
This could be something worth adding to a consolidated framework
("yellow" - move me somewhere else, I'm out of resources here). I have
this impression you'd quickly end up implementing something close to
heat/openstack then. Not that I'm opposed to that ;-)

Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde