[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Tue Sep 9 09:59:28 UTC 2008

On 2008-09-09T18:37:31, Satomi Taniguchi <taniguchis at intellilink.co.jp> wrote:

> Hi lists,
>
> I'm posting two patches to realize the function which we have discussed.
> One is for Pacemaker-dev(aba67759589),
> and another one is for Heartbeat-dev(fc047640072c).
>
> The specifications are the following.
>  (1) add the following 4 settings.
>       "period-length" - Period in seconds to count monitor op's failures.
>       "max-failures-per-period" - Maximum times per period a monitor may fail.
>       "default-period-length" - default value of period-length for the cluster.
>       "default-max-failures-per-period" - default value of 
> max-failures-per-period for the cluster.
>
>  (2) lrmd counts the monitor op's failures of each resource per period-length.
>      And it ignores the resource's failure until the number of times of that
>      exceeds the threshold (max-failures-per-period).

This means that this policy is enforced by the LRM; I'm not sure that's
perfect. Should this not be handled by the PE?

>  (3) If the value of period-length is 0, lrmd calculates the suitable length of
>      the period for the resource's operation.
>
>      NOTE:
>      "suitable" means "safe enough".
>      In this patch, the expression to calculate "suitable" value is
>      (monitor's interval + timeout) * max-failure-per-period.
>      If the value of period-length is too short, and the number of times which
>      monitor operation has finished in the period is less than the threshold,
>      lrmd will never notify its client that the resource is failure.
>      To avoid this, period-length requires the value which larger than
>      (monitor's interval + timeout) * (max-failures-per-period - 1), at least.
>      And allowing for the time of lrmd's internal processing or the margin of
>      error of OS's timer and so on, I considered the first expression is
>      suitable.
>
> In addition, I add the function to lrmadmin to show the following information.
>   i) the time when the period-length started of the specified resource.
>  ii) the value of the counter of failures of the specified resource.
> This is the third patch.

This means that the full cluster state is no longer reflected in the
CIB. I don't really like that at all.

> +	op_type = ha_msg_value(op->msg, F_LRM_OP);
> +
> +	if (STRNCMP_CONST(op_type, "start") == 0) {
> +		/* initialize the counter of failures. */
> +		rsc->t_failed = 0;
> +		rsc->failcnt_per_period = 0;
> +	}

What about a resource being promoted to master state, or demoted again?
Should the counter not be reset then too?

(The functions are also getting verrry long; maybe factor some code out
into smaller functions?)

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde