[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Thu Sep 11 07:24:33 UTC 2008

Personally, I think that this is simply the wrong approach.

If a resource doesn't want the cluster to react to a failure, then the  
RA just shouldn't report one.  Problem solved.

On Sep 11, 2008, at 9:06 AM, Satomi Taniguchi wrote:

> Hi Lars,
>
> Thank you for your reply.
>
>
> Lars Marowsky-Bree wrote:
>> On 2008-09-09T18:37:31, Satomi Taniguchi <taniguchis at intellilink.co.jp 
>> > wrote:
> [...snip...]
>>> (2) lrmd counts the monitor op's failures of each resource per  
>>> period-length.
>>>     And it ignores the resource's failure until the number of  
>>> times of that
>>>     exceeds the threshold (max-failures-per-period).
>> This means that this policy is enforced by the LRM; I'm not sure  
>> that's
>> perfect. Should this not be handled by the PE?
>
> At first, I also tried to implement this function in PE.
> But there were some problems.
> (1) PE has no way to clear fail-count.
>    When PE knows a resource's failure, the rsc's fail-count has  
> already
>    increased. So, it is proper to treat fail-count as the counter of
>    failure for this new function, if it is implemented in PE.
>    But, for example, when the period is over, it needs to clear the  
> fail-count.
>    At present, PE has no way to request something to cib.
>    PE's role is to create a graph based on current CIB, not to  
> change it,
>    as far as I understand.
>    And users may be confused if fail-count is cleared suddenly.
> (2) After a resource is failed once, even if it is failed again,  
> lrmd doesn't
>    notify crmd of the failure.
>    With new function, PE has to know the failure of resource even if  
> it occurs
>    consecutively. But normally, the result of monitor operation is  
> notified
>    only when it changes.
>    In addition, even if lrmd always notify crmd of the resource's  
> failure,
>    the rsc's fail-count doesn't increase because magic-number  
> doesn't change.
>    That is to say, PE can't detect consecutive failures.
>    I tried to the way to cancel the monitor operation of the failed  
> resource
>    and set the same op again.
>    But in this way, new monitor operation is done immediately,
>    then the interval of monitor operation becomes no longer constant.
>
> So, I considered it is more proper to implement the new function in  
> lrmd.
>
>>> (3) If the value of period-length is 0, lrmd calculates the  
>>> suitable length of
> [...snip...]
>>> In addition, I add the function to lrmadmin to show the following  
>>> information.
>>>  i) the time when the period-length started of the specified  
>>> resource.
>>> ii) the value of the counter of failures of the specified resource.
>>> This is the third patch.
>> This means that the full cluster state is no longer reflected in the
>> CIB. I don't really like that at all.
>
> I see what you mean.
> If it is possible, I want to gather all state of the cluster in the  
> CIB, too.
> For that purpose, I tried to implement this function in PE, at first.
> But it seems _not_ to be possible for the above reasons...
>
>>> +	op_type = ha_msg_value(op->msg, F_LRM_OP);
>>> +
>>> +	if (STRNCMP_CONST(op_type, "start") == 0) {
>>> +		/* initialize the counter of failures. */
>>> +		rsc->t_failed = 0;
>>> +		rsc->failcnt_per_period = 0;
>>> +	}
>> What about a resource being promoted to master state, or demoted  
>> again?
>> Should the counter not be reset then too?
>
> Exactly.
> Thank you for your pointing out.
>
>> (The functions are also getting verrry long; maybe factor some code  
>> out
>> into smaller functions?)
>
> All right.
> I will do so.
>
>> Regards,
>>    Lars
>
> Best Regards,
> Satomi TANIGUCHI
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker