[Pacemaker] RFC: Better error reporting for RAs.

Sun Aug 2 05:14:35 EDT 2009

On Aug 1, 2009, at 12:56 PM, Lars Marowsky-Bree wrote:

> Hi,
>
> so we all have seen plenty of cases where people, when asked what is
> wrong with their resources, report the pengine logfile  
> (unpack_rsc_op).
>
> Now, we all know that this is not the real error message, but just the
> PE analyzing the state of the cluster, based on the error to exit code
> mapping by the RA.
>
> Apparently, this is extremely hard to understand, and it seems very  
> hard
> for people to find the "real" error. Which in turn makes it very hard
> for them to fix their clusters. This is a real problem, one we've seen
> on the mailing lists, IRC, and quite a few customer incidents.
>
> Thinking about this, I have two suggestions.
>
>
> 1. Unique operation id
>
> The transition graph already includes an unique identifier for each
> action. If this was made, maybe, a bit shorter, and provided to the RA
> as part of the environment, the RA could include this as part of each
> log message - and if then this was also included in the CIB,
> crm_mon/pengine could provide the key which users could feed to grep  
> and
> much more quickly find out what exactly has been going wrong.

That would be transaction_key, which tells you which crmd instance,  
graph, action number, and expected result every action has.
Just log it at the various places you want.

Though I don't see the point, grepping for the resource id is usually  
just as effective.

>
> The LRM could log this "Operation <id> start" ..  "Operation <id>  
> end",
> and then a simple grep would suffice to grab everything in-between,
> narrowing down the log section considerably. This would enhance even  
> RAs
> which were not modified to include the op key in their logging.
>
>
> 2. Verbose error reporting
>
> The PE et al only care and interpret the exit code. While the exit  
> code
> is differentiated enough to categorize the error and allows the  
> cluster
> to figure out how to respond, it is not sufficient for users to figure
> out what is wrong. Case in point: "not installed" - what, exactly, is
> not installed?

Entirely dependent on the RA as you well know.

> A possible thought would be for the RA to print a one-line summary to
> stderr, and record this in the CIB along with the machine-readable
> encoded error. This would only be used for reporting to users.

No.
We already log error output when an action fails.
Again, easily found by grepping for the resource ID.

I'd suggest focusing on improving the error logging that most RAs have  
rather than adding yet more mechanisms for achieving the same thing.

> I think 1) is fairly easily implemented, and would be a big step
> forward. 2) is more complicated, but would make reporting via the GUI
> etc much more helpful.
>
> Comments?
>
>
> Regards,
>    Lars
>
> -- 
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar  
> Wilde
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

-- Andrew