[ClusterLabs] Antw: Antw: notice: throttle_handle_load: High CPU load detected

Tue Mar 29 15:22:00 CEST 2016

Ken, thank you for the answer.

Every node in my cluster under normal conditions has "load average" of
about 420. It is mainly connected to the high disk IO on the system.
My system is designed to use almost 100% of its hardware (CPU/RAM/disks),
so the situation when the system consumes almost all HW resources is
normal.
I would like to get rid of "High CPU load detected" messages in the
log, because
they flood corosync.log as well as system journal.

Maybe you can give an advice what would be the best way do to it?

So far I came up with the idea of setting "load-threshold" to 1000% ,
because of:
    420(load average) / 24 (cores) = 17.5 (adjusted_load);
    2 (THROTLE_FACTOR_HIGH) * 10 (throttle_load_target) = 20

    if(adjusted_load > THROTTLE_FACTOR_HIGH * throttle_load_target) {
        crm_notice("High %s detected: %f", desc, load);

In this case do I need to set "node-action-limit" to something less than "2
x cores" (which is default).
Because the logic is (crmd/throttle.c):

    switch(r->mode) {
        case throttle_extreme:
        case throttle_high:
            jobs = 1; /* At least one job must always be allowed */
            break;
        case throttle_med:
            jobs = QB_MAX(1, r->max / 4);
            break;
        case throttle_low:
            jobs = QB_MAX(1, r->max / 2);
            break;
        case throttle_none:
            jobs = QB_MAX(1, r->max);
            break;
        default:
            crm_err("Unknown throttle mode %.4x on %s", r->mode, node);
            break;
    }
    return jobs;

The thing is, I know that there is "High CPU load" and this is normal
state, but I wont Pacemaker to not saying it to me and treat this state the
best it can.

Thank you,
Kostia

On Mon, Mar 14, 2016 at 7:18 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On 02/29/2016 07:00 AM, Kostiantyn Ponomarenko wrote:
> > I am back to this question =)
> >
> > I am still trying to understand the impact of "High CPU load detected"
> > messages in the log.
> > Looking in the code I figured out that setting "load-threshold" parameter
> > to something higher than 100% solves the problem.
> > And actually for 8 cores (12 with Hyper Threading) load-threshold=400%
> kind
> > of works.
> >
> > Also I noticed that this parameter may have an impact on the number of
> "the
> > maximum number of jobs that can be scheduled per node". As there is a
> > formula to limit F_CRM_THROTTLE_MAX based on F_CRM_THROTTLE_MODE.
> >
> > Is my understanding correct that the impact of setting "load-threshold"
> > high enough (so there is no noisy messages) will lead only to the
> > "throttle_job_max" and nothing more.
> > Also, if I got it correct, than "throttle_job_max" is a number of allowed
> > parallel actions per node in lrmd.
> > And a child of the lrmd is actually an RA process running some actions
> > (monitor, start, etc).
> >
> > So there is no impact on how many RA (resources) can run on a node, but
> how
> > Pacemaker will operate with them in parallel (I am not sure I understand
> > this part correct).
>
> I believe that is an accurate description. I think the job limit applies
> to fence actions as well as lrmd actions.
>
> Note that if /proc/cpuinfo exists, pacemaker will figure out the number
> of cores from there, and divide the actual reported load by that number
> before comparing against load-threshold.
>
> > Thank you,
> > Kostia
> >
> > On Wed, Jun 3, 2015 at 12:17 AM, Andrew Beekhof <andrew at beekhof.net>
> wrote:
> >
> >>
> >>> On 27 May 2015, at 10:09 pm, Kostiantyn Ponomarenko <
> >> konstantin.ponomarenko at gmail.com> wrote:
> >>>
> >>> I think I wasn't precise in my questions.
> >>> So I will try to ask more precise questions.
> >>> 1. why the default value for "load-threshold" is 80%?
> >>
> >> Experimentation showed it better to begin throttling before the node
> >> became saturated.
> >>
> >>> 2. what would be the impact to the cluster in case of
> >> "load-threshold=100%”?
> >>
> >> Your nodes will be busier.  Will they be able to handle your load or
> will
> >> it result in additional recovery actions (creating more load and more
> >> failures)?  Only you will know when you try.
> >>
> >>>
> >>> Thank you,
> >>> Kostya
> >>>
> >>> On Mon, May 25, 2015 at 4:11 PM, Kostiantyn Ponomarenko <
> >> konstantin.ponomarenko at gmail.com> wrote:
> >>> Guys, please, if anyone can help me to understand this parameter
> better,
> >> I would be appreciated.
> >>>
> >>>
> >>> Thank you,
> >>> Kostya
> >>>
> >>> On Fri, May 22, 2015 at 4:15 PM, Kostiantyn Ponomarenko <
> >> konstantin.ponomarenko at gmail.com> wrote:
> >>> Another question - is it crmd specific to measure CPU usage by "I/O
> >> wait"?
> >>> And if I need to get the most performance of the running resources in
> >> cluster, should I set "load-threshold=95%" (or even 100%)?
> >>> Will it impact the cluster behavior in any ways?
> >>> The man page for crmd says that it will "The cluster will slow down its
> >> recovery process when the amount of system resources used (currently
> CPU)
> >> approaches this limit".
> >>> Does it mean there will be delays in cluster in moving resources in
> case
> >> a node goes down, or something else?
> >>> I just want to understand in better.
> >>>
> >>> That you in advance for the help =)
> >>>
> >>> P.S.: The main resource does a lot of disk I/Os.
> >>>
> >>>
> >>> Thank you,
> >>> Kostya
> >>>
> >>> On Fri, May 22, 2015 at 3:30 PM, Kostiantyn Ponomarenko <
> >> konstantin.ponomarenko at gmail.com> wrote:
> >>> I didn't know that.
> >>> You mentioned "as opposed to other Linuxes", but I am using Debian
> Linux.
> >>> Does it also measure CPU usage by I/O waits?
> >>> You are right about "I/O waits" (a screenshot of "top" is attached).
> >>> But why it shows 50% of CPU usage for a single process (that is the
> main
> >> one) while "I/O waits" shows a bigger number?
> >>>
> >>>
> >>> Thank you,
> >>> Kostya
> >>>
> >>> On Fri, May 22, 2015 at 9:40 AM, Ulrich Windl <
> >> Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >>>>>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am
> >> 22.05.2015 um
> >>> 08:36 in Nachricht <555EEA72020000A10001A71D at gwsmtp1.uni-regensburg.de
> >:
> >>>> Hi!
> >>>>
> >>>> I Linux I/O waits are considered for load (as opposed to other
> >> Linuxes) Thus
> >>> ^^ "In"
> >>                             s/Linux/UNIX/
> >>>
> >>> (I should have my coffee now to awake ;-) Sorry.
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20160329/21724822/attachment.html>