[Pacemaker] RFC: What part of the XML configuration do you hate the most?
Andrew Beekhof
beekhof at gmail.com
Thu Sep 11 07:24:33 UTC 2008
Personally, I think that this is simply the wrong approach.
If a resource doesn't want the cluster to react to a failure, then the
RA just shouldn't report one. Problem solved.
On Sep 11, 2008, at 9:06 AM, Satomi Taniguchi wrote:
> Hi Lars,
>
> Thank you for your reply.
>
>
> Lars Marowsky-Bree wrote:
>> On 2008-09-09T18:37:31, Satomi Taniguchi <taniguchis at intellilink.co.jp
>> > wrote:
> [...snip...]
>>> (2) lrmd counts the monitor op's failures of each resource per
>>> period-length.
>>> And it ignores the resource's failure until the number of
>>> times of that
>>> exceeds the threshold (max-failures-per-period).
>> This means that this policy is enforced by the LRM; I'm not sure
>> that's
>> perfect. Should this not be handled by the PE?
>
> At first, I also tried to implement this function in PE.
> But there were some problems.
> (1) PE has no way to clear fail-count.
> When PE knows a resource's failure, the rsc's fail-count has
> already
> increased. So, it is proper to treat fail-count as the counter of
> failure for this new function, if it is implemented in PE.
> But, for example, when the period is over, it needs to clear the
> fail-count.
> At present, PE has no way to request something to cib.
> PE's role is to create a graph based on current CIB, not to
> change it,
> as far as I understand.
> And users may be confused if fail-count is cleared suddenly.
> (2) After a resource is failed once, even if it is failed again,
> lrmd doesn't
> notify crmd of the failure.
> With new function, PE has to know the failure of resource even if
> it occurs
> consecutively. But normally, the result of monitor operation is
> notified
> only when it changes.
> In addition, even if lrmd always notify crmd of the resource's
> failure,
> the rsc's fail-count doesn't increase because magic-number
> doesn't change.
> That is to say, PE can't detect consecutive failures.
> I tried to the way to cancel the monitor operation of the failed
> resource
> and set the same op again.
> But in this way, new monitor operation is done immediately,
> then the interval of monitor operation becomes no longer constant.
>
> So, I considered it is more proper to implement the new function in
> lrmd.
>
>>> (3) If the value of period-length is 0, lrmd calculates the
>>> suitable length of
> [...snip...]
>>> In addition, I add the function to lrmadmin to show the following
>>> information.
>>> i) the time when the period-length started of the specified
>>> resource.
>>> ii) the value of the counter of failures of the specified resource.
>>> This is the third patch.
>> This means that the full cluster state is no longer reflected in the
>> CIB. I don't really like that at all.
>
> I see what you mean.
> If it is possible, I want to gather all state of the cluster in the
> CIB, too.
> For that purpose, I tried to implement this function in PE, at first.
> But it seems _not_ to be possible for the above reasons...
>
>>> + op_type = ha_msg_value(op->msg, F_LRM_OP);
>>> +
>>> + if (STRNCMP_CONST(op_type, "start") == 0) {
>>> + /* initialize the counter of failures. */
>>> + rsc->t_failed = 0;
>>> + rsc->failcnt_per_period = 0;
>>> + }
>> What about a resource being promoted to master state, or demoted
>> again?
>> Should the counter not be reset then too?
>
> Exactly.
> Thank you for your pointing out.
>
>> (The functions are also getting verrry long; maybe factor some code
>> out
>> into smaller functions?)
>
> All right.
> I will do so.
>
>> Regards,
>> Lars
>
> Best Regards,
> Satomi TANIGUCHI
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
More information about the Pacemaker
mailing list