[Pacemaker] RFC: Better error reporting for RAs.

Mon Aug 3 09:55:57 UTC 2009

On Mon, Aug 3, 2009 at 11:27 AM, Lars Marowsky-Bree<lmb at suse.de> wrote:
> On 2009-08-02T11:14:35, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>>> The transition graph already includes an unique identifier for each
>>> action. If this was made, maybe, a bit shorter, and provided to the RA
>>> as part of the environment, the RA could include this as part of each
>>> log message - and if then this was also included in the CIB,
>>> crm_mon/pengine could provide the key which users could feed to grep and
>>> much more quickly find out what exactly has been going wrong.
>> That would be transaction_key, which tells you which crmd instance, graph,
>> action number, and expected result every action has.
>> Just log it at the various places you want.
>
> Doesn't get passed to the RA though, or am I missing something? It's not
> in the environment.

Easily changed though.
Inject it after: 	
   op->params = xml2list(rsc_op);
in construct_op().

>> Though I don't see the point, grepping for the resource id is usually just
>> as effective.
>
> The problem is that it isn't. That shows up all PE messages, too much of
> the TE etc, and the lines which tend to be extremely obvious - because
> they are actually tagged with "ERRROR:" and repeat several times are the
> PE ones, which confuses users.

| grep -v pengine:

>
> We need a better way to backtrack from the log message to the actual
> invocation which caused the error to be recorded.
>
> Actually the transition key _is_ already in the CIB

That would be why I suggested it.

> and at least the
> start is recorded in the logs (but not the completion, which would
> presumably cheap to add) - except that it is a bit longish.

All completions are logged... crmd/lrm.c:1861

> Would you
> object to logging the completion event on the node where it was run too,
> and possibly including it in crm_mon/pengine logs when they log an ERROR
> during unpack_rsc?

Thats what I was suggesting when I wrote "Just log it at the various
places you want."
I'd possibly even only log the first segment of the transitioner UUID.

> (I'd suggest the lrm_rsc_op id attribute, but that is not unique within
> the cluster, neither on the time nor node axis.)
>
>>> 2. Verbose error reporting
>>>
>>> The PE et al only care and interpret the exit code. While the exit code
>>> is differentiated enough to categorize the error and allows the cluster
>>> to figure out how to respond, it is not sufficient for users to figure
>>> out what is wrong. Case in point: "not installed" - what, exactly, is
>>> not installed?
>>
>> Entirely dependent on the RA as you well know.
>
> Exactly, that is the point why better/more verbose reporting would be
> welcome - right now, all the RA can do is log, which sucks for users,
> because they get lost in the multitude of logs we spew.
>
>>> A possible thought would be for the RA to print a one-line summary to
>>> stderr, and record this in the CIB along with the machine-readable
>>> encoded error. This would only be used for reporting to users.
>> No.
>> We already log error output when an action fails.
>
> That is not helpful enough for users. If you doubt that, read some bug
> reports ;-)

Because I've never seen one of those?

>> I'd suggest focusing on improving the error logging that most RAs have
>> rather than adding yet more mechanisms for achieving the same thing.
>
> It is not the same thing. It would allow crm_mon or the GUI to display
> something more verbose and thus useful to users, and reduce the work
> load for the poor souls having to analyse the bug reports.

The GUI already knows how to trawl the logs for PE inputs.
If we add the key like you're suggesting, it should be perfectly
capable of pulling this up too.