Excelling at Exception Notification

Fighting inadequate exception notification– do your background queues, assistive servers, and OS report issues?

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

An exception notifier does exactly what its name suggests: It relays reports about exceptions occurring on your production software. Not only do these reports enhance your awareness of active bugs, but they also ease debugging thanks to the availability of detailed information about application state at the moment a bug strikes.

Of course, as with many programmer tools, there is a knack to using exception notifiers. As such, I’ve come to the following set of best practice guidelines for squeezing the most from my exception notifiers:

1. Watch Out for General Exception Swallowing Mechanisms

By default, exception notifiers send their alerts only when an exception percolates all the way up through the application without getting rescued.

This is inadequate and here’s why: Assuming you care about your web application’s user experience, you won’t want your users to be confronted by ugly, unpredictable software exceptions. As a means of shielding users, most developers rescue exceptions late in the request cycle and then gracefully redirect their users to friendly-looking error pages.

The problem with this design, from an exception notification point of view, is that you won’t receive notifications about any of these swallowed exceptions. What’s more, your lack of awareness doesn’t stop these from continuing to negatively affect users. Thus you remain in the dark until someone writes to complain.

The trick to avoiding this informational lacuna is simple, so long as you remember to do it: Rewrite the code that redirects to your error page so that it explicitly sends an exception notification first. Here’s an example in Ruby:

rescue => e
  redirect_to :homepage, flash: {alert: "We encountered an error. We’re on it. "}

2. Look Out for Errors in Background Jobs

Modern web applications conduct much of their business behind closed doors. Is your exception notifier privy to these goings-on? Will you get alerts when an exception occurs in one of your background queues? By default, you won’t—not unless you patch your background worker library so that it sends you alerts after the maximum number of attempts was reached.

And in case you were wondering: Why wait until the job has failed for the maximum number of attempts instead of sending exception reports after the first failure? This is to allow for transient errors (e.g. momentary network connectivity problems). There’s likely no point in sending an alert message for a temporary failure.

3. Capture Client-Side JavaScript Exceptions

Contemporary websites rely increasingly on client-side JavaScript to do their job. Absent any express efforts to broadcast JavaScript failures, any hiccups occurring in your client-side code will show as exceptions in your users’ browsers but won’t ever enter your team’s awareness.

At the time of writing the solutions for JavaScript exception notification are mostly paid services, such as Sentry, Airbrake, and New Relic. That said, I have no doubt that good open source alternatives will soon be available.

4. Remember to Provision All Your Assistive Servers with Their Own Exception Notification Systems

A hot (IMO too hot) theme in web development circles is the splitting up of monolithic applications into suites of microservices that provide specialised, narrow bits of functionality.

These assistive servers are often weaved from a lighter fabric than your primary server: Instead of being based on full-blown web frameworks with all their bells and whistles, microservices are usually pared-down, highly focused programs. And herein lies the problem—these little slices of functionality often exist without any machinery to detect and communicate errors. Don’t make this mistake… be sure to outfit these services with exception notification capabilities.

5. Comprehensively Capture Errors Occurring at the OS Level

Exception notifiers are usually installed as part of the main web application, which will be built in Ruby, PHP, or whatever else happens to be the flavour of the day. But it’s quite likely there’s more to your stack than just your primary web application. Indeed it would be patently absurd not to rely on external pre-built programs for certain tasks, such as rendering movies (with ffmpeg). As such, it’s quite common for web applications to “shell out” to other programs at just the right time in the application flow:

generated_movie_response = `ffmpeg options…`
if invalid?(generated_movie_response)
  # Code for generating exception
  ExceptionNotifier.notify_exception(shell_error_message: generated_movie_response)

Now notice that the call to ffmpeg could fail, for example because you mishandle its API, or because you don’t have enough RAM, or because the planets don’t align for some other reason. The above code goes some of the way toward alerting you of these errors, thanks to the explicit use of the exception notifier. But if you want an optimal debugging experience, you ought to go further. The problem is that, in Unix at least, any command’s output gets split across two separate streams: STDOUT (file descriptor 1) and STDERR (file descriptor 2). Ruby—along with many other programming languages—has various means of “shelling out”, including the backticks invocation depicted above. You need to be careful about this though, because certain mechanisms drop certain parts of the output. In particular the backpacks method shown above only captures STDOUT, meaning that any ffmpeg error messages sent to STDERR would be lost, having been ignored. Because of this blindspot, debugging becomes more cryptic than it ought to be.

You can improve the above snippet either by using a different Ruby method for shelling out or by explicitly redirecting STDERR to STDOUT.

6. Package Bonus Information into Your Exception Notifications

Good exception notifications don’t just alert you about the existence of bugs; they also help you figure out what caused them and perform damage control. As such, you should package bonus information into your exception reports. One low-effort way to achieve this is to interpolate extra info into your exception message. For example, compare the following variants:

   raise SellerCurrencyDoesNotMatchCurrencyOfSales, "Seller currency does not match their products’ currency"
   raise SellerCurrencyDoesNotMatchCurrencyOfSales, "Seller with (id: \#{id}, email: \#{email}) has currency set to \#{currency} whereas their products are in these currencies: \#{products.map(&:currency)}"

Notice how the second exception message includes a payload of identifying information (seller ID, seller email, seller currency, and the currencies of each of the seller’s products). Because exception notifiers forward the original exception messages, I’m sure to receive all this tasty information within my alerts. With this I’m now able to (1) identify the bug’s cause; (2) repair any malformed records in the database; (3) contact affected users to explain we are aware of the issue.

7. Capture Exceptions Occurring before Exception Notification Systems Even Boot

Inherent to exception notification libraries is a shortcoming: They cannot detect errors occurring before the library boots up. Because of this, any bugs that cause your exception notification library to fail to start up will not trigger any alerts and will therefore escape your awareness.

For the perfectionistic readers, it’s possible to prevent this class of error by introducing another error-detection system sitting outside the process housing the exception notification code. This outside observer monitors whether or not the observed process is running, thereby checking for errors at a conceptually simpler level.

For those serving from their own Unix boxes, the open source Monit tool has long been a favourite for external observation. As you might guess from its name, Monit monitors processes on your server and whenever it senses trouble, it takes corrective actions (such as restarting a process) and sends you an alert.

8. Avoid Overly Vocal Notifications

Overly chirpy exception notifiers are annoying. This is especially true when the alerts happen to be near clones that are traceable to the same underlying issue.

Imagine how jaded you’d feel if you woke up to an inbox saturated with identical exception reports. You’d lose patience with your alerting system and you’d probably be tempted to deal with the deluge by assuming that all the innumerable exception reports clogging your inbox are one and the same. If this presumption of identity had turned out be to misguided, then you would have encountered a problem: You’d have ended up missing exception reports that announced the presence of important, unseen, new bugs.

How can one get around this conundrum?

  1. Configure your software such that it stops triggering exceptions in routine, benign cases. This approach embodies the advice that exceptions are supposed to be reserved for behaviour that’s genuinely exceptional.

  2. Configure exception notifiers to ignore certain classes of error. For example, it’s probably unimportant to you if a search engine bot tries to access a bazillion URLs that don’t exist. It would probably be wiser to turn off the exception notifications on this one and instead leave some warnings in the logs.

  3. Direct different classes of exception notifications into graded categories of alerting channels (e.g. over Slack). The Slack channel named “fatal_exceptions” will, obviously be a high priority—the kind of bombshell you’d wake up at 4am for. By contrast the “warnings_and_oddities” channel will be unimportant and you’ll probably only eyeball this once every few weeks.

  4. Use a paid solution like Rollbar, which parses exception reports then and then clusters similar exception types together for you, thereby helping you keep abreast of the novelty that matters.

More Articles: Click here for full archive

4 Ideas For Improving Testability in Web Apps

Backdoors, interface hooks, explanatory dry runs, and deep reachability


Why your code should be capable of remembering and replaying inputs in the event that something goes wrong

Textmate to VIM

Where I explain how to reproduce over 110 commands from Textmate in VIM.