Why your code should be capable of remembering and replaying inputs in the event that something goes wrong

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

Over the years, I’ve noticed a particular property of software systems that eases damage control if and when something goes wrong. I haven’t encountered a name for this property elsewhere, so for the sake of having a conceptual token to refer to, I’m going to don this trait replayability. Broadly speaking, a software system exhibits replayability if it is capable of remembering and replaying inputs in the event that something goes wrong. Once the bug or outage afflicting the system has been alleviated, the history of inputs can be replayed on command instead of requiring the programmers to piece together disparate trails from logs (or losing requests altogether). This enables the software system to bounce back and resume service without missing anything important. The package arrives late instead of not at all.

A web application can be conceptualised as a program that first accepts inputs in the form of HTTP requests, then proceeds to process them, and finally returns outputs—usually as HTML, but sometimes as JSON, text, video, or more exotic formats. In a typical RESTful design, POST requests are designated as the carriers of detailed user inputs. These inputs usually correspond to the contents of web forms. If these inputs are lost because of website failure, it leads to problems: for example, a visitor to your find-a-mortgage website might have invested forty grueling minutes filling out an application form. If their data is lost, it’s unlikely they’ll have the patience to try again with your web application and, as such, you’ll lose them (and their commissions) to your competitors. In a similar vein, imagine you have an administrative employee who spends half a day writing SEO descriptions on your system, only to see them irrevocably gobbled up by a system hiccup. Not only does the lost input damage your company to the tune of half a day of wasted wages, but it also damages the morale of the hapless administrator who is now faced with half a day of tedious repetition.

There is an argument that, thanks to the presence of log files, input information is never truly lost in a web application. But anyone who’s maintained a site for more than a few weeks knows this isn’t quite true: for example, user-uploaded binary files leave only their filenames as remnants within logs; their innards are forever lost as temporary files. Regular inputs go missing too: any website practicing even a modicum of security will strip sensitive inputs from the logs, protecting passwords, credit card numbers, and medical histories from prying eyes. This information cannot be recovered through logs. And even if, hypothetically speaking, the full parcel of user inputs could be reconstructed from logs, it would be painfully tedious and error-prone, especially if the bug affected hundreds or thousands of requests.

So, wouldn’t it be great if there were an easier way to replay these requests once the problem gets fixed?

Well luckily there is, at least for any website with architecture that employs background queues. Background queues, as part of their lifecycle, intercept inputs (e.g. the parsed payload of a HTTP POST request) and store these in the database as jobs (units of work waiting to be done later). The background workers then attempt these jobs one by one, such that any particular job will either complete successfully or fail. Because a failed job remains in the background queue, this affords the website owner the possibility of rerunning the job again as soon as the bug is resolved. This is de facto replayability, a free side effect that comes from moving discrete globs of functionality to the background queue.

Knowing that background queues give free replayability suggests that you could increase overall replayability by reshaping even more website features into the mould of background tasks. This is the case, even if the feature is not of the kind that one would normally delegate to the background. Some background first: As many readers probably already know, one of the usual motivations for background jobs is to handle slow-running tasks. When your server needs to send an email, this can take five seconds. But you don’t want to leave the visitor sitting in suspenseful limbo all that time. To avoid this state of affairs, web developers use background queues, such that the time-intensive emailing part gets queued for later whereas the user promptly sees a (quick-to-generate) HTML response saying that the email was queued. This setup allows the user to quickly take the reins again and continue interacting with your website.

Now the new idea I am suggesting here is that you wrap up in background jobs tasks that would execute (near) instantaneously, such as run-of-the-mill form submissions. The basis for this suggestion is that the advantages of replayability outweigh the inconveniences of background tasks (mainly the hassle of communicating outputs to users upon job completion, as done today through mechanisms like JavaScript updates saying “your email was successfully sent” or through auto refreshes of the page once something is ready).

Obviously more sophisticated techniques for replayability can be conceived, and I invite you to explore them—I only wished to share my duct tape version for the benefit of other hackers who don’t have access to unlimited programming resources but do want more bang for their buck.

More Articles:

Minimum Viable Backups for Web Apps

A list of the various nooks and crannies needing backing up followed by a look at the most common failure modes of a backup system

Taking Data Integrity Srsly

Data Validity Spot Checks, No-Delete Policies, Database Constraints, and Care with NULLs vs. FALSEs etc.

Dealing with Test Leakage

Strategies for pinpointing and extricating indeterminacy from software tests.