[Pacemaker] RFC: What part of the XML configuration do you hate the most?
Lars Marowsky-Bree
lmb at suse.de
Tue Sep 9 09:59:28 UTC 2008
On 2008-09-09T18:37:31, Satomi Taniguchi <taniguchis at intellilink.co.jp> wrote:
> Hi lists,
>
> I'm posting two patches to realize the function which we have discussed.
> One is for Pacemaker-dev(aba67759589),
> and another one is for Heartbeat-dev(fc047640072c).
>
> The specifications are the following.
> (1) add the following 4 settings.
> "period-length" - Period in seconds to count monitor op's failures.
> "max-failures-per-period" - Maximum times per period a monitor may fail.
> "default-period-length" - default value of period-length for the cluster.
> "default-max-failures-per-period" - default value of
> max-failures-per-period for the cluster.
>
> (2) lrmd counts the monitor op's failures of each resource per period-length.
> And it ignores the resource's failure until the number of times of that
> exceeds the threshold (max-failures-per-period).
This means that this policy is enforced by the LRM; I'm not sure that's
perfect. Should this not be handled by the PE?
> (3) If the value of period-length is 0, lrmd calculates the suitable length of
> the period for the resource's operation.
>
> NOTE:
> "suitable" means "safe enough".
> In this patch, the expression to calculate "suitable" value is
> (monitor's interval + timeout) * max-failure-per-period.
> If the value of period-length is too short, and the number of times which
> monitor operation has finished in the period is less than the threshold,
> lrmd will never notify its client that the resource is failure.
> To avoid this, period-length requires the value which larger than
> (monitor's interval + timeout) * (max-failures-per-period - 1), at least.
> And allowing for the time of lrmd's internal processing or the margin of
> error of OS's timer and so on, I considered the first expression is
> suitable.
>
> In addition, I add the function to lrmadmin to show the following information.
> i) the time when the period-length started of the specified resource.
> ii) the value of the counter of failures of the specified resource.
> This is the third patch.
This means that the full cluster state is no longer reflected in the
CIB. I don't really like that at all.
> + op_type = ha_msg_value(op->msg, F_LRM_OP);
> +
> + if (STRNCMP_CONST(op_type, "start") == 0) {
> + /* initialize the counter of failures. */
> + rsc->t_failed = 0;
> + rsc->failcnt_per_period = 0;
> + }
What about a resource being promoted to master state, or demoted again?
Should the counter not be reset then too?
(The functions are also getting verrry long; maybe factor some code out
into smaller functions?)
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
More information about the Pacemaker
mailing list