Silent Failures Must Be Given a Voice

The most dangerous bugs are those you remain chronically unaware of–as all too often occurs with failing scheduled scripts

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

Most legal systems around the world distinguish between actions and omissions (i.e. failures to act). A person will be convicted of homicide after taking some deadly action, like shooting, poisoning, or bludgeoning. But many legal systems also penalise deadly omissions, like the failure to feed a young child under one’s guardianship, or the refusal to throw out a life buoy to a struggling swimmer.

In a similar vein, software bugs are rooted not only in erroneous and exception-causing acts, but also in problematic omissions. In my experience, these bugs-by-omission are more dangerous than bugs due to erroneous action. The reason why is that they are far less likely to be brought into a team’s awareness. Without any concrete thing that crashes or throws an error, there are typically no exception reports and no notification messages that get sent. The tech team might only become aware of these bugs-by-omission after a customer or administrator complains about how they never received this email or that download or about how a certain product doesn’t appear in the on-site search.

What causes bugs-by-omission in the lifespan of a web app in Rails or some such? Typically the culprits are derailed “schedule” scripts, such as cron-jobs gone AWOL. By way of example, features like the following could be affected:

  • Drip feed (or otherwise automated) email campaigns. No one will tell you if these break. It’s not like your customers will knock on your door and say, “Hey Mr./Ms. CEO. I never received the reactivation email your system usually shoots off after I’ve been inactive for two weeks. What’s up with that?” As such, you will be left in the dark if your email campaigns break down, and this will insidiously dampen your revenue.

  • Cleanup of bulky old log files or temporary zip files. Without regular cleanup, data accumulates on your hard drive and eventually you’ll wind up bumping into memory limitations that will cripple or crash your server. Not good.

  • Daily automated backups. Without up-to-date backups, your company is open to catastrophe should a database failure occur. How do you know that your backup scripts are continuing to run as they’re supposed to?

  • Monthly royalty payments to contributors or affiliates. If these payments don’t get sent out at the right time, your valued suppliers will have a bitter and amateur impression of your company’s dealings. Worse yet, people who are dependent on the timely receipt of income from your company could face economic hardship, say, when they can’t pay their rent.

So what can us programmers do to make bugs-by-omission more salient? Are there any precautions we can take? I’d like to suggest three:

  1. Bind the performance of scheduled activities to the sending out of an alert to the website administrators. This could be done, for example, by emails or Slack notifications to an appropriate channel. Human readers of those Slack channels will come to expect regularly seeing these activity notifications, and should a regular customer stop showing up, your staff will become suspicious that something has gone wrong behind the scenes. The downside to this approach is that these notification channels can become quite noisy—perhaps so much so that you lose patience and stop wanting to check the notifications at all!

  2. Build the website administrator a health report page that summarises recent activity in various sectors—showing, for example, how many marketing emails were sent, how old the most recent backup is, and how much hard drive space is free. Because this health report only works if it is read, website owners with less than iron self-discipline should opt to have these reports delivered to their attention via email/Slack. Using all-encompassing health reports has the advantage over the previous technique in that there is now a controlled number of reports—say one per day—meaning there is much less annoying notification noise. This option is my preferred solution.

  3. Schedule scans that will alert you if and only if something seems awry—for example, the scan will alert you if the last backup happened more than a week ago, or if no marketing emails were sent at all in the last twenty-four hours. This approach has the advantage of minimising notifications to just the most informative. But its disadvantage is that it’s afflicted by the very same weakness as the problem it’s intended to solve: if this meta alerter fails to run, then you’ll never know when something it’s supposed to monitor is broken, defeating the purpose of its addition. One hackish, imperfect way around this shortcoming would be to piggyback this meta alerter into external software layers that you believe to be more reliable than your own. For example, you might program Google Analytics to alert you if no one has visited a certain page in the last twenty-four hours, where this page is one only visited by users responding to some scheduled action.

Update: Since publishing, I learned that there is an amusingly named service that checks up on your cronjobs: Dead man’s snitch

More Articles: Click here for full archive

Minimum Viable Backups for Web Apps

A list of the various nooks and crannies needing backing up followed by a look at the most common failure modes of a backup system

The Key to Good Documentation: Broaden Your Definition of Software

Or how to avoid frustration configuring, debugging, and rescuing servers and third-party services


A simple technique to make your logs more useful when debugging