The Bug Slip

A bug hunting ritual that makes things a little easier

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

Do you have a bug-hunting ritual? Depending on who in the office you ask, a bug hunt might start with a fresh cup of coffee, or a set of pull-ups to get the blood going, or with the marshalling of your trusty rubber duck collection.

We build these rituals because bug hunts are tough; indeed, the rougher hunts can easily lay waste to your entire week’s schedule. All of this means it’s not all unreasonable to divert a sizeable chunk of our mental energies to designing strategies which will make code repair that little bit more structured, reliable, and ultimately faster.

Let’s begin this journey by thinking about what it is, exactly, that slows down, muddies, or upsets a bug hunt. Once we’ve articulated these aggravating factors, we will be in a strong position to design a bug-hunting ritual that serves us better.

Circumstances That Exasperate Code Repair

As a self-taught programmer—someone who spent the first four years of his career flailing about without any mentors or experienced engineers to reach out to—I reckon I’ve dealt with more bugs than the lead animator at Pixar. Looking back over these bug-fraught years, I’m able to single out a handful of factors that exacted outsized punishments to my well-being.

1. Bugs spanning multiple sittings (*or standings)

Whenever I failed to solve a bug in a single programming session, I noticed that, upon resuming the next day, I no longer held most of the bug’s context in my head. Not only that, but I had forgotten many of the clues as to what the bug was or wasn’t. And I no longer recalled which avenues I had ruled out and which ones I hadn’t.

Before you accuse me of having a terrible memory—a charge I will not wholly deny—I’d ask you to remember that programming is a technical field, one that demands its practitioners recall details with precision rather than mere approximation. It’s not good enough to remember that your CouponCalculator’s result is “off”—as a programmer you need to know exactly how much it is off by, and in exactly what circumstances this is so.

The cost of rebooting a debugging session is exasperated whenever the time between subsequent sessions extends beyond a day or two. As such, those who are working on side projects in their precious spare hours will be stung most severely.

2. Brain-crunching complexity

Our human brains are only so big and, unfortunately for us, software bugs don’t respect these biological hardware constraints. Often, when debugging, we must hold reams of details in mind. This sort of mental juggling degrades our creativity and diminishes our ability to reason analytically. If you’d like to confirm the existence of this performance-marring effect in other areas of your life, try the following: Do a crossword puzzle (or construct some IKEA furniture) while simultaneously holding an arbitrary seven-digit figure at the forefront of your mind.

The taxing effects of complexity will be particularly strong in junior programmers: Without years—or decades—of experience, they lack the mental models to concisely represent and compress the myriad data showing up in a bug hunt.

3. “Oh let me fix this tiny thing”

Fixing bugs means revisiting old code. Oftentimes this means opening up files that haven’t seen the light of day for an awfully long time. Many a time when I’ve done this, the contents of these rarely visited modules would reveal a horror house of issues which were unrelated to the bug I set out to solve. I’d see method signatures with the wrong semantics, out-of-sync comments, typos in consumer-facing website copy, and so on and so forth.

Believing (often with unfounded confidence) that fixing up these problems would be quick and painless, I would carry out the repairs as I went along with the main bug hunt. As it transpired though, the needed fix-ups were all-too-often far more extensive than I had anticipated. Consequently, I would find myself diving too deeply down into tangential rabbit holes—all the while losing the precious mental context I had already formed of the original bug. What’s more is the constant, confusing context switches involved in such a discursive style of programming would precipitate other problems, such as my accidentally leaving the code in a “debugging” state wherein log statements or temporary logic modifications would persist into production code. This is the engineering equivalent of a surgeon who sows their patient back up, only to realise the scalpel is still sitting inside their patient’s chest.

4. Packing up and going immediately after the code is fixed

You’ve found the bug, written a patch to fix it, and deployed it to production. Does this mean you can call it a day? Not by a long shot.

Bugs leave behind a path of destruction—corrupt database records, confused customers, lost leads, broken online advertising campaigns, bogus cached records, incorrect financial statements, and dysfunctional jobs in the background queue. As such, fixing the code itself is but one facet of the clean-up operation.

To give an obvious example: Imagine your website pays suppliers royalties. A bug causing a payment to go to the wrong person is going to require a ream of palliative measures, such as restituting the wrongly paid cash, transferring it to its rightful owner, updating the company’s financial reports, and communicating with all the parties involved.

5. Closing the doors to collaboration

Bug hunts are typically thought of as solitary affairs. One programmer locks down an issue and then sets themself to work on it. Typically, under this model, the debugging programmer’s interim activities (and increments of progress) are invisible to other members of their team; no one else knows the roads they are travelling.

This state of affairs isn’t ideal though. When a bug impacts one of the software’s busier routes, all hands must be called on deck in a way that doesn’t duplicate work. These emergencies aside, even run-of-the-mill bugs can often benefit from collaboration. This is especially true when one team member has exhausted their ability to solve the bug and needs to pass the mantle on to someone else.

Bug Slips: A Ritual to Systematically Address Failings in Code Repair

After spending time thinking about how to alleviate the aforementioned complicating factors, I arrived at a decidedly lo-fi solution that I call “Bug Slips”. The core principles underlying this approach are:

  1. to approach the debugging session in a standardised, battle-tested manner, ensuring I don’t forget a step that experience has shown I shouldn’t. This aspect of the approach should remind you of the arguments in The Checklist Manifesto. In addition to the thoroughness this systematic approach brings, it also facilitates collaboration, in that different people can work in parallel on different aspects of the problem.

  2. to incrementally offload information to the written word, thereby a) freeing up my mental capacity to reason; b) making it easier to resume a debugging session the next day; and c) making it easier to share my progress with other team members.

  3. to leave a permanent written record either for future reference or for supervisory review.

  4. to catch the bug more quickly through improving the quality of my thought. This rests on the idea that writing is the clearest form of thinking.

The first step in debugging with Bug Slips is to create a new, shared, text document related to the bug in question. You’ll then divide this document up into various sections and fill them in, flitting back and forth between them if need be. The totality of all the sections may well be overkill for small, quick-fix bugs; nevertheless it proves its mettle when battling substantial, Armageddon-class problems.

Bug Slip Section 1: What Happened

The first section of your bug slip should contain a short, high-level description of the bug. I usually include the original complaint received from a customer or the first line from the exception report.

Example 1

A customer who bought the Aztec Notes product complained that the files from the 2014 author were missing in the order zip file they downloaded.

Example 2:

‘No Country Found’ error thrown from ‘/app/app/models/accounting/vat_calculator.rb:23:in `vat_required?’

Bug Slip Section 2: Hypotheses

In any bug hunt, I believe it’s crucial to keep a running list of hypotheses as to what’s causing the issue at hand. Simply by forming and articulating your ideas, you prime your brain to seek out evidence that confirms or disaffirms your suspicions, thereby lending structure to what could otherwise be a disorderly and unfocused search.

In this context, a well-written hypothesis should contain two elements:

  • A guess about the cause of the problem

  • Some expectation(s) about what you would and wouldn’t observe if that particular guess turned out to be true.

Armed with a list of hypotheses in this form, you will tackle the bug by attempting to rule out each potential cause. The order in which you do this will generally be one of (or some mix of) the following: “hypotheses most likely to be the cause” and “hypotheses that are quickest to be ruled out”.

An added bonus to having a running list of hypotheses listed is that it facilitates programmers on your team debugging in parallel.

Examples:

I have written these out in wordy form here, though sometimes I use sentence fragments and shorthand in my actual debugging work.

–Hypothesis 1–
The customer was mistaken about their entitlement to download the 2014 files from the Aztec Notes product and there is, in fact, no problem with our zip files.
Expectation: This customer’s order would have been completed before 1 Jan 2014.
Investigation: I checked the DB and order #12345 had a completed_at attribute set to “24 June 2014”, meaning the customer was entitled to these upgrades.
Result: FALSE

–Hypothesis 2–
The 2014 files for the update were never placed in a zip file—specifically, the ZipFileCreator#zip_files method might not have been called when the 2014 files were first added to the system, contrary to normal and expected behaviour.
Expectation: No customers have access to the 2014 files.
Investigation: After inspecting the database, I found exactly this. No customers had access to these files. On a hunch, I then checked the logic that was supposed to carry out the zipping of updates (by calling ZipFileCreator#zip_files). I discovered this logic was contained within an #after_create callback, which, according to the Rails API docs, does not get run during saves of existing products—of which the Aztec Notes was one.
Result: Bingo!

Bug Slip Section 3: Data

Bug hunts within production software often require the gathering and analysis of data­—be that from logs, the database, custom queries run in the console, or the return values of hard-to-reach functions. It’s good to keep these results around, especially if the data is too complex to yield answers after an initial cursory glance.

Pro tip: Be sure to also copy over any console code you used to generate the data. You never know when you’ll need to generate it again in the future or with modification.**
**

Examples:

Data from the LineItem that experienced the bug:

{“id”=>5569, “order_id”=>6895, “product_id”=>1013589462, “price”=>#<BigDecimal:7fb35325f798,’0.9999E2’,18(18)>, “created_at”=>Thu, 17 Jan 2013 19:14:38 UTC +00:00, “updated_at”=>Thu, 17 Jun 2014 19:14:38 UTC +00:00, “currency”=>”GBP”}

Generated with: LineItem.find(5669).joins(Order.complete.today)

Bug Slip Section 4: Potential Solutions

Above we had a section listing potential causes of the bug. In this section we introduce a place for listing potential solutions. The reason for this is that there is nearly always more than one way to solve a particular bug, although each solution comes with its own matrix of tradeoffs. Instead of simply accepting the first fix that springs to mind, it’s good practice to spend a few minutes brainstorming other ideas. Especially when dealing with hairier problems, this exercise in lateral thinking helps you arrive at ever more elegant and less invasive solutions that may potentially shave days off the time needed to program.

Examples:

Solution 1: Change the after_create callback to an after_save. This is the most minimal and least invasive change.

Solution 2: Do away with this callback madness altogether and explicitly call ZipFileCreator#zip_files after adding updated files in the ProductsController.

Solution 3: Build a cronjob that scans for missing ZipFiles hourly and creates whatever seems to be missing.

Solution 4: Rewrite so that ZipFiles get created dynamically by the customer instead of in advance. The advantage of this is that it saves computing resources

Bug Slip Section 5: Related/Other Issues

Every time you encounter issues other than the primary one under investigation, make a quick entry into this section so as to get it off your mind and ensure you don’t forget about it.

Examples:

The way ZipFileCreator#proposed_name is written indicates it will probably have problems when there is a space in the product name.

There is an unused method in the ProductsController.

Bug Slip Section 6: Cleanup

As mentioned in the introduction, bugs in production software often necessitate treatment that extends far beyond mere patches to the code. That’s why we need this section to take a note of each required repair as it occurs to us.

A warning: As an example of the cruellest irony, I’ve noticed that the cleanup operation following a successful bug squashing is itself extraordinarily bug-ridden. One reason for this, I believe, is that cleanups tend to interact with our systems in a totally non-standard way, for example by modifying the database directly instead of affecting the changes through the normal APIs. In doing so, we leave open the risk that we forget to run some crucial step that the normal API would ordinarily carry out. What’s more, the once-off scripts used for repair are often no better than first drafts of code—which is the draft most likely to contain as-of-yet unexposed bugs. The whole situation is exasperated by the fact that the person doing the cleanup is often in a terrible rush to complete the repair and restore normality. So how do we shield ourselves from this damage as best we can? For one, we should interact with our data exclusively through the standard, tested interfaces—creating new ones if the situation demands it. Secondly, we should, when warranted, treat our repair scripts as serious code and consider writing some tests and sanity checks for them.

Examples:

All the product records in the last month are missing zip files, therefore, at least for some of the proposed solutions, we ought to recreate them all post-deploy.

Every customer who bought a product with updates in the last month would not have received the 2014 zip file. After the fix is deployed, we ought to mass email these customers to inform them they have additional files waiting for them in their download section.


More Articles:

A Guide to Testing Web Applications for Developers Who Care about Overall Profitability

Presented in the form of 10 axioms


Replayability

Why your code should be capable of remembering and replaying inputs in the event that something goes wrong


Project-Level Attribute Normalisation

Why you should introduce a software layer close to your object relational mapper that formats attributes into clean, standardised forms