[Pacemaker] RFC: Better error reporting for RAs.

Lars Marowsky-Bree lmb at suse.de
Sat Aug 1 10:56:47 UTC 2009


Hi,

so we all have seen plenty of cases where people, when asked what is
wrong with their resources, report the pengine logfile (unpack_rsc_op).

Now, we all know that this is not the real error message, but just the
PE analyzing the state of the cluster, based on the error to exit code
mapping by the RA.

Apparently, this is extremely hard to understand, and it seems very hard
for people to find the "real" error. Which in turn makes it very hard
for them to fix their clusters. This is a real problem, one we've seen
on the mailing lists, IRC, and quite a few customer incidents.

Thinking about this, I have two suggestions.


1. Unique operation id

The transition graph already includes an unique identifier for each
action. If this was made, maybe, a bit shorter, and provided to the RA
as part of the environment, the RA could include this as part of each
log message - and if then this was also included in the CIB,
crm_mon/pengine could provide the key which users could feed to grep and
much more quickly find out what exactly has been going wrong.

The LRM could log this "Operation <id> start" ..  "Operation <id> end",
and then a simple grep would suffice to grab everything in-between,
narrowing down the log section considerably. This would enhance even RAs
which were not modified to include the op key in their logging.


2. Verbose error reporting

The PE et al only care and interpret the exit code. While the exit code
is differentiated enough to categorize the error and allows the cluster
to figure out how to respond, it is not sufficient for users to figure
out what is wrong. Case in point: "not installed" - what, exactly, is
not installed?

A possible thought would be for the RA to print a one-line summary to
stderr, and record this in the CIB along with the machine-readable
encoded error. This would only be used for reporting to users.


I think 1) is fairly easily implemented, and would be a big step
forward. 2) is more complicated, but would make reporting via the GUI
etc much more helpful.

Comments?


Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde





More information about the Pacemaker mailing list