www.morling.dev Open in urlscan Pro
2606:50c0:8002::153  Public Scan

URL: https://www.morling.dev/blog/whats-in-a-good-error-message/
Submission: On April 02 via manual from US — Scanned from DE

Form analysis 2 forms found in the DOM

<form id="myForm">
  <input type="text" id="inputSearch" name="q" placeholder="Search..." onfocus="warmUp(this)">
  <button type="submit" id="buttonSubmitSearch" style="line-height: normal;"><i id="iconSearch" class="fa fa-search"></i></button>
</form>

<form id="myFormMobile">
  <input type="text" id="inputSearchMobile" name="q" placeholder="Search..." onfocus="warmUp(this)">
  <button type="submit" id="buttonSubmitSearchMobile" style="line-height: normal;"><i id="iconSearchMobile" class="fa fa-search"></i></button>
</form>

Text Content

GUNNAR MORLING


RANDOM MUSINGS ON ALL THINGS SOFTWARE ENGINEERING

 * 
 * 
 * 

 * Blog
 * Projects
 * Conferences
 * Podcasts
 * About




GUNNAR MORLING


RANDOM MUSINGS ON ALL THINGS SOFTWARE ENGINEERING

 * 
 * 
 * 

 * Blog
 * Projects
 * Conferences
 * Podcasts
 * About




WHAT'S IN A GOOD ERROR MESSAGE?

Posted at Jan 12, 2022

Update Jan 13: This post is discussed on Reddit

Update Feb 7: This post is discussed on Hacker News

As software developers, we’ve all come across those annoying, not-so-useful
error messages when using some library or framework: "Couldn’t parse config
file", "Lacking permission for this operation", etc. Ok, ok, so something went
wrong apparently; but what exactly? What config file? Which permissions? And
what should you do about it? Error messages lacking this kind of information
quickly create a feeling of frustration and helplessness.

So what makes a good error message then? To me, it boils down to three pieces of
information which should be conveyed by an error message:

 * Context: What led to the error? What was the code trying to do when it
   failed?

 * The error itself: What exactly failed?

 * Mitigation: What needs to be done in order to overcome the error?

Let’s dive into these individidual aspects a bit. Before we start, let me
clarify that this is about error messages created by library or framework code,
for instance in form of an exception message, or in form of a message written to
some log file. This means the consumers of these error messages will typically
be either other software developers (encountering errors raised by 3rd party
dependencies during application development), or ops folks (encountering errors
while running an application).

That’s in contrast to user-facing error messages, for which other guidance and
rules (in particular in regards to security concerns) should be applied. For
instance, you typically should not expose any implementation details in a
user-facing message, whereas that’s not that much of a concern — or on the
contrary, it can even be desirable — for the kind of error messages discussed
here.


CONTEXT

In a way, an error message tells a story; and as with every good story, you need
to establish some context about its general settings. For an error message, this
should tell the recipient what the code in question was trying to do when it
failed. In that light, the first example above, "Couldn’t parse config file", is
addressing this aspect (and only this one) to some degree, but probably it’s not
enough. For instance, it would be very useful to know the exact name of the
file:

> Couldn’t parse config file: /etc/sample-config.properties"

Using an example from Debezium, the open-source change data capture platform I
am working on in my day job, the second message could read like so with some
context about what happened:

> Failed to create an initial snapshot of the data; lacking permission for this
> operation

Coming back to error messages related to the processing of some input or
configuration file, it can be a good idea to print the absolute path. In case
file system resources are provided as relative paths, this can help to identify
wrong assumptions around the current working directory, or whatever else is used
as the root for resolving relative paths. On the other hand, in particular in
case of multi-tenant or SaaS scenarios, you may consider filesystem layouts as a
confidential implementation detail, which you may prefer to not reveal to
unknown code you run. What’s best here depends on your specific situation.

If some framework supports different kinds of files, the specific kind of the
problematic file in question should be part of the message as well: "Couldn’t
parse entity mapping file… ". If the error is about specific parts of the
contents of a file, displaying the line number and/or the line itself is a good
idea.

In terms of how to convey the context of an error, it can be part of messages
themselves, as shown above. Many logging frameworks also support the notion of a
Mapped Diagnostic Context (MDC), a map for propagating arbitrary key/value pairs
into log messages. So if your messages are meant to show up in logs, setting
contextual information to the MDC can be very useful. In Debezium this is used
for instance to propagate the name of the affected connector, allowing Kafka
Connect users to tell apart log messages originating from different connectors
deployed to the same Connect cluster.

As far as propagating contextual information via log messages is concerned (as
opposed to, say, error messages printed by a CLI tool), structured logging,
typically in form of JSON, simplifies any downstream processing. By putting
contextual information into separate attributes of a structured log entry,
consumers can easily filter messages, ingest only specific sub-sets of messages
based on their contents, etc.

In case of exceptions, the chain of exceptions leading to the root cause is an
important contextual information, too. So I’d recommend to always log the entire
exception chain, rather than catching exceptions and only logging some
substitute message instead.


THE ERROR ITSELF

On to the next part then, the description of the actual error itself. That’s
where you should describe what exactly happened in a concise way. Sticking to
the examples above, the first message, including context and error description
could read like so:

> Couldn’t parse config file: /etc/sample-config.properties; given snapshot mode
> 'nevr' isn’t valid

And for the second one:

> Failed to create an initial snapshot of the data; database user 'snapper' is
> lacking the required permissions

Other than that, there’s not too much to be said here; try to be efficient: make
messages as long as needed, and as short as possible. One idea could be to work
with different variants of messages for the same kind of error, a shorter and a
longer one. Which one is used could be controlled via log levels or some kind of
"verbose" flag. Java developers may find Cédric Champeau’s jdoctor library
useful for implementing this. Personally, I haven’t used such an approach yet,
but it may be worth the effort for specific situations.


MITIGATION

Having established the context of the failure and what went wrong exactly, the
last — and oftentimes most interesting — part is a description of how the user
can overcome the error. What’s the action they need to take in order to avoid
it? This could be as simple as telling the user about the constraints and/or
valid values in case of the config file example (i.e. akin to test failure
messages, which show both expected and actual values):

> Couldn’t parse config file: /etc/sample-config.properties; given snapshot mode
> 'nevr' isn’t valid (must be one of 'initial', 'always', 'never')

In case of the permission issue, you may clarify which ones are needed:

> Couldn’t take database snapshot: database user 'snapper' is lacking the
> required permissions 'SELECT', 'REPLICATION'

Alternatively, if longer mitigation strategies are required, you may point to a
(stable!) URL in your reference documention which provides the required
information:

> Couldn’t take database snapshot: database user 'snapper' is lacking the
> required permissions. Please see
> https://example.com/knowledge-base/snapshot-permissions/ for the complete set
> of necessary permissions

If some configuration change is required (for instance database or IAM
permissions), your users will love you even more if you share that information
in "executable" form, for instance as GRANT statements which they can simply
copy, or vendor-specific CLI invocations such as aws iam attach-role-policy
--policy-arn arn:aws:iam::aws:policy/SomePolicy --role-name SomeRole.

Speaking of external resources referenced in error messages, it’s a great idea
to have unique error codes as part of your messages (such as Oracle’s ORA codes,
or the error messages produced by WildFly and its components). Corresponding
resources (either provided by yourself, or externally, for instance in answers
on StackOverflow) will then be easy to find using your favourite search engine.
Bonus points for adding a reference to your own canonical resource right to the
error message itself:

> Couldn’t take database snapshot: database user 'snapper' is lacking the
> required permissions (DBZ-42). Please see https://dbz.codes/dbz-42/ for the
> complete set of necessary permissions

(That’s a made-up example, we don’t make use of this approach in Debezium
currently; but I probably should look into buying the dbz.codes domain 😉).

The key take-away is that you should not leave your users in the dark about what
they need to do in order to address the error they ran into. Nothing is more
frustrating than essentially being told "You did it wrong!", without getting
hinted at what’s the right thing to do instead.


GENERAL BEST PRACTICES

Lastly, some practices in regards to error messages which I try to adhere to,
and which I would generally recommend:

 * Uniform voice and style: The specific style chosen doesn’t matter too much,
   but you should settle on either active vs. passive voice ("couldn’t parse
   config file" vs. "config file couldn’t be parsed"), apply consistent casing,
   either finish or not finishes messages with a dot, etc.; not a big thing, but
   it will make your messages a bit easier to deal with

 * One concept, one term: Avoid referring to the same concept from your domain
   using different terms in different error messages; similarly, avoid using the
   same term for multiple things. Use the same terms as in other places, e.g.
   your API documentation, reference guides etc.; The more consisent and
   unambiguous you are, the better

 * Don’t localize error messages: This one is not as clear cut, but I’d
   generally recommend to not translate error messages into other languages than
   English; Again, this all is not about user-facing error messages, but about
   messages geared towards software developers and ops folks, who generally
   should command reasonable English skills; depending on your audience and
   target market, translations to specific languages might make sense, in which
   case a common, unambiguous error code should definitely be part of messages,
   so as to facilitate searching for the error on the internet

 * Don’t make error messages an API contract: In case consumers of your API
   should be able to react to different kinds of errors, they should not be
   required to parse any error messages in order to do so. Instead, raise an
   exception type which exposes a machine-processable error code, or raise
   specific exception types which can be caught separately by the caller

 * Be cautious about exposing sensitive data: if your library is in the business
   of handling and processing sensitive user data, make sure to to not create
   any privacy concerns; for instance, "show actual vs. expected value" may not
   pose a problem for values provided by an application developer or
   administrator; but it can pose a problem if the actual value is GDPR
   protected user data

 * Either raise an exception OR log an error, but not both: A given error should
   either be communicated by raising an exception or by logging an error.
   Otherwise, when doing both, as the exception will typically end up being
   logged via some kind of generic handler anyways, the user would see
   information about the same error in their logs twice, which only adds
   confusion

 * Fail early: This one is not so much about how to express error messages, but
   when to raise them; in general, the earlier, the better; a message at
   application start-up beats one later at runtime; a message at build time
   beats one at start-up, etc. Quicker feedback makes for shorter turn-around
   times for fixes and also helps to provide the context of any failures

With that all being said, what’s your take on the matter? Any best practices you
would recommend? Do you have any examples for particularly well (or poorly)
crafted messages? Let me know in the comments below!



Please enable JavaScript, or join the discussion on GitHub.
© 2021 Gunnar Morling | Licensed Under Creative Commons BY-SA 4.0