One of the recurring themes in this series is the idea of programming on a shoestring—the heroic vision of a one- or two-person tech “team” cobbling together a super low-cost solution that, despite spluttering about for a spell, still manages to pull through at the end of day and get the job done.
Given a situation characterised by a dearth of programming resources, such a team obviously has less scope for careful engineering. As such, it’s fair to say that the biggest technical risk isn’t Russian hackers or power outages on AWS; rather it’s the incompetence of the team, particularly when the programmers are stressed or overworked. Recognising the constant presence of this danger—however unpleasant—is the first step to mitigating against it. This article is here to help guide you along the way.
The Doomsday Scenario of Not Having Backups
Data loss has a near unmatched potential to annihilate your business, soil your professional reputation, and deplete your asset reserves. If you’re not already convinced of the reality of this danger, let me spell out a doomsday scenario: Say you sell digital downloads and your contractual trading terms promise customers that their downloads will be available “for life”. One day you go to update your production database with a command line program. You stupidly mix up the source and destination parameters fed to this program and this leads to your digital download repository being irreversibly wiped out, all because of a few confused keystrokes; now you’re unable to honour your contractual terms to deliver the downloads already paid for. As a result, you are legally liable to return every customer’s money, a legal claim that may extend back many years. These damages could amount to many times your business’s lifetime profits since the refund amounts will be for the full sales figures, whereas your take-home profits were only a fraction of this amount, reduced because of expenses and taxes, both of which may be impossible to claim back. This problem becomes life ruining if you happen to be operating without the protections of limited liability; in this case you stand to lose not just everything in your business capacity but also everything you personally own, thereby leading to total ruin.
The incidental lesson here is to never promise anything indefinitely (or for extended periods). Don’t ever promise “free updates for life” or that the “monthly price will never rise”. That’s a terrible, terrible idea.
The main lesson to all this is–as the article title suggests–that you really, really, really should back up your data.
Part A: So What Exactly Needs to Be Backed Up?
1. SQL database dumps
This one is so obvious it’s barely worth mentioning: Back up your SQL database. Indeed, this should be your highest priority, since even to this day, most web applications rely on SQL databases to store their most important user and business data. Typically SQL databases have built-in tools to dump the entire database into a file that you can then chaperone off to somewhere safe.
2. Alternative permanent data stores
We live in the age of multiple data stores. While an SQL database might hold your relational data (such as the order histories of each customer), a graph database might be commissioned to store data about a social network, and an instance of Amazon S3 might be used to store large, binary files such as uploaded images, movies, and PDFs. Any complete backup solution must also create copies of each and every one of the alternative data stores you use in your web app.
3. Data stored with third-party services
Just because you don’t personally gather and store certain kinds of data on your own servers doesn’t make this data any less important. As I’ve belaboured elsewhere, Google Analytics can become your oracle for data-driven marketing decisions when configured to track your online advertising spending and ecommerce revenue. But if you’ve got a bug in your implementation, then its analytical power gets spoiled. Such deleterious effects aren’t just local to the time when the bug was active; they also extend to any intervals containing this compromised time period. For example, any analysis of the “full year” or, indeed, “all time” will no longer be accurate. This loss of analytical ability could have been prevented by having a backup of your third-party data, as might be done by you downloading a dump of this partner’s data. In situations when the third-party service in question has no functionality for downloading or restoring data, you’ll have to resort to decidedly more hackish solutions. I’ve met teams who create multiple tracking profiles on the Google Analytics on day one, designating one as master and the others as backups. Before any deploy that has the potential to adversely affect its Analytics data, the team switches just over just one of these profiles, but not the other. Only when they ascertain safety do they move the other profiles over. This approach, while admittedly clunky and far from generally applicable, does have the advantage that damage is limited to one profile at a time, effectively giving the team backups.
4. Code and its revision history
So far we’ve talked about backing up data; now let’s talk about backing up code. Luckily this is effortless with modern source control techniques, like Git combined with GitHub. The only trick is to ensure that you regularly push all of your branches —even work-in-progress branches —to the central repository. Case in point: A forlorn programmer recently confided that he was working for three weeks on a feature branch that he didn’t push to GitHub since it wasn’t “finished”. It transpired that his laptop got stolen on a train and three weeks of code was lost with it. This loss could have easily been avoided had he just pushed his code every evening. Needless to say, this habit should be adhered to regardless of whether or not there are any other programmers collaborating on the project.
5. Server images, server configuration files
Building and correctly configuring your production servers is both messy and prone to error. This is especially true when your web application requires a specially provisioned and tweaked operating system. For this reason you ought to have copies of your server configuration files (+ ENV variables/ + daemons etc.) stored somewhere other than on your server. You won’t go wrong by storing this information within your server provisioning scripts, or within special files in source control system, or within operating system images you store on Amazon S3.
6. Local development machine
Over the years, we programmers amass reams of development tools, configuration files, custom aliases, keyboard remappings, dot files, text editor plugins, four-hours-to-install compilers, and yada yada yada. As such, a stolen or broken laptop could mean days of work in replacing all this custom functionality. In fact, you might never realistically manage to replicate your old setup, and this loss can be quite the hit to both your productivity and your morale. Luckily you can completely eliminate this risk by installing software on your development machine that regularly and automatically backs up your hard drive to the cloud. I can highly recommend a piece of software called ARQ. This creates daily snapshots of your hard drive and uploads it to Amazon S3. It’s saved my ass once.
7. Company training manuals and documentation
In operating your business, your team is likely to follow certain defined processes, be that in performing taxation duties or in answering customer service emails. Eventually you’ll document these processes, usually with the intention of removing yourself from day-to-day operations. It takes rather significant labour to write these documents, so it’s important to back them up so as to protect against their loss. A good general solution here is to adopt a company-wide policy that documents be exclusively created and stored in the cloud (with Google Docs or some such). This approach could then be supplemented with automatic fetching and mirroring of documents on your local hard drive (which itself ought to be backed up, as mentioned above).
8. Email and customer service history
Reference to previous communications is essential for understanding your customers’ needs, for researching previous bugs affecting them, and plenty more. What’s more is that in many legal jurisdictions today contracts can be concluded over email. Losing these essentially amounts to losing your power to enforce the contract should the other party breach its obligations. The best way to preserve old emails is—drumroll—not to delete them. Pragmatically speaking, this means you ought to archive conversations once you’re done with them instead of foolhardily deleting them.
Have you ever met someone who’s gotten permanently locked out of their primary Gmail account? I have—all it took was for them to forget their password and not have a current phone number on file with Google. What kind of fool wouldn’t even remember their passwords, you ask? The kind who relies on technology to remember passwords for them, meaning that their memories gradually fade due to never having to type them in again.
But that’s all just amateur hour compared to the modern risks in our age of multi-factor authentication (MFA): What happens if you lose or break the mobile phone containing your MFA app? This could very well mean getting locked out of your accounts for some critical suppliers. Imagine, for example, being shut out of your Heroku server, your Amazon AWS infrastructure, and your payments system. Disaster.
Some suppliers offer special restore codes you can download upon initial setup of MFA authentication on your phone. If you happen to later lose your phone, you can recover your MFA details by typing in one of these codes. Naturally, your ability to recover through this route depends on your having already taken down and stashed these codes away in a safe place.
But what if some of your suppliers don’t provide this backup option? How is this risk mitigated? On the more paranoid end of the spectrum is my security-conscious friend who directs a pretty serious tech company in the automotive industry. He bought a separate mobile phone solely for the purposes of logging in with MFA. He then stores said phone within his office’s safe. I also presume that he has rabies-ridden dogs guarding this safe, but as of the time of writing I haven’t been able to confirm this. Less extreme is another programmer friend who cracked his phone’s operating system and copied the image—including the MFA app and credentials—to a backup location in the cloud.
Part B: Failure Modes of Backups
So you’re now fairly certain that you’ve got everything important backed up. What could possibly go wrong? Here are the failure modes of well-intentioned backup systems:
Let me share a surprising bit of bug trivia: Fatal exceptions—ones that bring software to a grinding, abortive halt—most frequently occur in sections of code that are supposed to handle rare errors. This is cruelly ironic yet it is hardly surprising given the fact that, by definition, these sectors are rarely—if ever—executed by the program. More than anything, regular use of code flushes out bugs. Conversely, code that is rarely used is the most likely to contain errors.
This little factoid about where hidden bugs lurk in regular code suggests a parallel take-home lesson about the code we have for restoring backup systems. This code is–you’d hope–rarely used in day-to-day situations. By similar reasoning to above, backup systems code is likely to harbour an abnormally high number of bugs. This is hardly reassuring since the the last thing you want to discover, post-disaster, is that your backup recovery system is bust.
Don’t just take my word on this. Search through Hacker News and you’ll find stories aplenty about businesses that discovered, too late, that their backup files were corrupted and therefore impossible to restore from. In these cases, prepare to say goodbye to your life.
Given these risks, it’s absolutely crucial to test both your backup and your backup recovery systems. You need to ensure not only that backups get regularly taken but also that you can restore from them. Indeed, this is a great place to bring in automated software tests.
2. Backups missing some data
I’m going to recount a series of events as a test of your ability to spot this particular failure mode of backup systems: I wanted to back up my entire Amazon S3 data store before a dramatic upcoming deploy. The backup tool available at that time was so painfully slow that the backup took six hours to finish. (The script worked by grabbing a list of all the files on my S3 bucket and then copying them to a backup bucket one by one.) Once this process had finished running, I deployed the next major version of my web application and migrated its production database. To my dismay, part of the accompanying database migration failed, so I urgently needed to restore chunks of my original S3 data from the backup I had just created. After having run the restore script, my dismay turned to horror as I realised there were still holes in the data. How was this even possible?
Here’s how: The backup had produced an exact match of the data as it was when I started the script. Any ensuing data that my web application added during the runtime of the backup process (all six hours of it) was left out.
Now we get to the underlying principle: Always ensure that data added during backup runs also gets backed up. This can be achieved rather easily by running the exact same backup script once again immediately after it completes its first run. Any files data added during the six-hour window of the first run would get backed up by the second run. And given that the relative delta of new data added during those six hours would be but a small fraction of the data backed up by the first run, you can rest assured that this second run would complete quickly, leaving only a seconds-long window for unaccounted-for data.
(There is another option for web application teams that don’t want to run the risk of any data going missing: Switch the website into maintenance mode while the backup runs. Simply serve up a static HTML page for the whole website —that way, no database writes could possibly occur.)
3. Not regularly run
Allow me to conjure up an annoyingly didactic oxymoron: “A well-designed backup system that’s rarely run”. There’s little point in having a wonderful, error-free backup system that you only bother running once every two years. In fact, to the extent that your backup is run rarely, it fails in fulfilling its primary purpose of capturing a (near) complete copy of your data. Therefore, part of the design of any backup system should be its automation. Cronjobs and other schedulers accomplish this, albeit with the accompanying risk that when automation fails, it tends to fail silently and without you being aware of it. (See my article on silent failures for solutions.)
So what I’m saying is that a “fire-and-forget” strategy is unacceptable. It’s not enough to set up a backup system today and then assume the automation will work forever ever more. You might argue that you already have excellent exception reporting in your application, but I would counter that this is irrelevant, since backup systems often run as independent programs which are outside the reach of your regular monitoring and alerting apparatus. With this in mind you will, at the very least, need to outfit your backup system with an additional exception notifier, and then perhaps supplement this with additional checks, such as a weekly script that sends you a Slack notification every Monday at 9am with the dates and file sizes of your most recent backups.
Programmers who deploy often might consider an additional protection: Their deploy script ought to back up before pushing the latest changes. The point of this is to guard against the potentially fatal changes that come packaged with deploys (e.g. database migrations). Having this recent backup automatically created means that if something goes wrong during a shaky deploy you have little to fear. (BTW I am assuming you switched the application into maintenance mode before the deploy, thereby stopping any further writes from occurring.)
4. Insufficient redundancy
Seemingly every hosting company will try to sell you their automatic backup services which are designed to run daily, weekly, etc. In my admittedly bitter opinion, these are barely deserving of the term “backups”: instead they are, for the most part, the illusion of backups, or commercial exploitations of fear.
I used a popular VPS provider (with automatic backups) for two years to host a chunk of my software. One day I noticed an endless stream of exceptions spouting from my server. I quickly learned that not only was one of my servers down, but it had been completely deleted! Said VPS provider mass-emailed its customers later that evening to announce it had been hacked by a disgruntled former employee and that he had maliciously deleted many of their servers—as well as all their backups. Had I not had another backup stored elsewhere, I would have been snookered. While the interruption to business certainly sucked, the damage was minimal, something I cannot say was true for their other clients—some of whom may have lacked redundant backups.
To summarise, given the possibility of your primary backup failing, it’s prudent to back up that backup to a secondary location, for example by transferring its SQL dump to S3.
5. Backups clobbering production data
As if this article isn’t gloomy enough, I’ve got another nightmare situation: You or someone you’ve hired modifies your backup script in such a way that the source parameter now appears where the destination parameter should be. Running this script will now replace your production database with your backup database, thereby deleting any data modified since your last backup.
This potential catastrophe can be avoided with about sixty seconds invested in pre-emptive measures. One such measure is to use “no delete” or “keep revisions around” settings in your data store. Another is to run your backup scripts in processes/profiles/accounts that have read-only access to your production data. And, as mentioned above, you should always have backups of your backups.
The bigger principle here is that it’s unrealistic to expect your programmers to always be playing their A game. Likewise you cannot expect every line of code in a project to be bug free. As such, a more pragmatic approach to designing resilient software is to layer fail-safes, so that disaster can only occur if the planets align against you and multiple levels break simultaneously.
6. Being a cheapskate
Nature gave us two kidneys even though we can survive on one. In a calculus that exclusively focus on efficiency, you might think that nature has wasted energy and resources in constructing an unnecessary extra blood filter. But an efficient organism isn’t much good at passing on its genes if it happens to be dead. Thus nature opted for redundancy.
In the same vein, so to speak, building and maintaining backups costs money, and it will no doubt multiply your storage costs. But remember: Cutting costs here is a fool’s folly. It is the logical equivalent of selling your right kidney because you think your left one is perfectly capable on its own.