A Guide to Testing Web Applications for Developers Who Care about Overall Profitability

Presented in the form of 10 axioms

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

As a once highly engaged member of the Rails community, I was audience to a flurry of strong opinions on everything and anything related to software testing. For about five years, I indulged in every testing fad—sometimes for better (e.g. Capybara) but more often for worse (e.g. Cucumber). Fed up with drifting wherever the prevailing winds of hype took me, I decided I needed to develop a more principled approach to testing. With this in mind I sat down one evening and wrote down nine basic axioms I believe to be true about software testing. Here goes:

Axiom 1: Writing, running, and maintaining tests consumes resource

Most straightforwardly, there’s the basic cost of writing an initial battery of tests cases. Less apparent is the cost of the scaffolding needed by a testing suite—this includes things like a testing environment (i.e. an isolated database, appropriate environmental variables, and sandboxed connections to third-party services like payment providers). If you have a database—and almost everyone does—you’ll also need data fixtures/factories for seeding said database(s) for testing. Eventually you’ll need performance optimisations that improve test speed, as opposed to old-fashioned performance optimisations that merely enhance the speed of the consumer-facing product. Adding even more to the costs of testing, you may well need to redesign your consumer-facing product to improve testability, e.g. by adding interface hooks that tests can couple on to or by refactoring the code so that objects use dependency injection. And finally, atop all these upfront costs, there is the recurring price of change: Future modifications or deletions within the main codebase will often necessitate corresponding changes in your tests. All in all, tests don’t come cheap, so the ones you decide to keep around had better pay their way.

Axiom 2: Bugs caught early cost less to fix

In the days of hardware mainframes—before the web and long before continuous deployment was a thing—the financial cost of fixing a bug post-release was estimated to be up to a thousand times that of catching the same bug prior to release. This multiplier has, no doubt, greatly decreased. For one, we no longer need to fly engineers halfway across the globe to install new firmware at clients’ offices. Now we live in the age of continuous deployment and auto updates, with more and more software services residing in the cloud and delivered through APIs.

Despite these innovations, bugs still cost less to repair prior to deployment. This is because production bugs:

  • diminish revenue

  • damage or destroy data that would have otherwise been unaffected

  • precipitate floods of customer support requests

  • tarnish company branding

  • balloon engineering costs since your developers, working overtime, have to locate and then tear out malfunctioning components without putting their patient, a living website, in any danger.

In the most egregious of cases, production bugs might only be noticed months after release. By this point, it’s possible that your team spent the entire intervening period building and building and building atop the old, buggy implementation which must now be scrapped. This could potentially lead to a colossal and demoralising waste of effort.

Axion 3: The main purpose of your business is to maximise profit

That’s not to say your business doesn’t serve other purposes, but I’m fairly confident that the raison d’être for most businesses is the creation of profit. If you’re running a one-man or one-woman show, define profit to include not just cash but also the other great good: time.

When this third axiom (maximising profit) is accepted in conjunction with the first (the fact that writing tests consumes resources), we see a possible source of tension: Expending resources on tests may diminish take-home profit. This danger, however, is mitigated when we insist that our tests save the business more resources in the long run than they consume; this principle is the ultimate arbitrator of whether any proposed test is worth writing—and whether any existing test is worth maintaining.

With this calculus in mind, how, roughly, does one determine which tests pay their rent and which ones are in permanent arrears? I propose tests should be in place when any of the following apply:

  • Physical or economic harm happens upon feature failure - e.g. If you have a medical records application and a bug might cause information about a patient’s anaesthetic allergies to get mixed up, thereby potentially exposing the patient to death. Or if you build pitch deck software that could potentially leave its users unable to finish their once-in-a-lifetime pitch should your slideshow feature break at an inopportune moment.

  • Legal consequences - e.g. Ensuring that the unsubscribe button works in your email marketing campaigns so as not to infringe anti-spamming laws and regulations; Or ensuring that your log writer filters out data protected by privacy laws.

  • Revenue-generating flows - e.g. Obviously your ecommerce shopping cart and checkout flow merit testing, as would your subscription payment modules. You don’t want to risk losing revenue due to technical failure.

  • Landing pages - e.g. The individual product pages in an ecommerce web application or the individual questions and answers in a Quora-like content platform. Landing pages are particularly important to test whenever you use paid marketing, because if they break this would mean the cash you spent on buying traffic would go wasted. As such, prioritise testing landing pages over testing any other pages.

  • Core features that are often used by your customers (and for which you will be contractually liable for damages if you fail to supply them) - e.g. If you provide a translation API that your paid subscribers consume, it had better be working. In my notes-selling business, the provisioning and delivery of digital products are the most critical core features to test.

  • Warning systems for learning of production environment failures - e.g. The “contact us” form customers use to relay their problems and bug reports to your team; And the machinery that delivers exception reports whenever there is a software failure in the wild.

  • Features that protect against irreversible damage - e.g. Functionality that guards against inadvertent deletion or corruption of mission-critical data. In my business, suppliers delete digital download products all the time, but I need safeguards to ensure that my system never removes any product some customer already bought the rights to.

Let me continue by elucidating features I believe aren’t as important to test. Don’t me wrong—I certainly don’t condone a blanket ban on testing these; rather, I suggest you take pause and deliberate on the cost/benefit ratio of testing them:

  • administrative backend features (e.g. dashboards for you to have an overview of your business)

  • rarely used features

  • presentational features (e.g. pretty formatting of financial figures)

  • “nice-to-have” features (e.g. full text search on a website that already has a satisfactory fallback for taxonomic browsing)

  • “unhappy paths” (e.g. whatever happens when some piece of code that calls a third-party service times out instead of completing successfully).

Axiom 4: Two tests can reveal wildly different amounts of information about the health of your system—choose the most informative

Have you ever brought your production web application grinding to a halt due to a dirt simple, entirely preventable, and wholly embarrassing bug, such as a misspelling of an initialisation constant? Errors like these can be prevented by the quickest of sanity checks: Ensuring the software is capable of booting up before deploying. In plainer words, this is equivalent to checking, “Does this thing switch on?” This check is perhaps the most useful and informationally dense one you could possibly imagine. At least in the run-time compilation madness of the Ruby world, this check will inform you about any missing constants, files, or classes within broad swathes of your codebase. Since many of these errors are fatal —enough to bring your Ruby application to its knees —they constitute a devastating threat. Yet we can become aware of their existence with this ridiculously simple single test. As such, if you’re going to run just one test, ensure that your software boots up.

Of course you’ll probably want to test more than just this —which brings us to the next most informationally dense place for testing a web application: the various entry points, as represented by the URLs and APIs your application responds to. To test these, simply call each URL/API method and ascertain whether it also “switches on”. For GET requests, this could be implemented with something as simple as a headless browser that hits these URLs and looks for 200 responses. For POST requests (and friends), you’ll also need to pass in the minimum additional information (form parameters, etc.).

Along with hitting the aforementioned endpoints, I’d also be quick to test the various emails generated and sent by a web application. I want to know if every one of these emails renders. The failure of any of these tests won’t necessarily pinpoint where the error resides, but it does say, loudly and clearly, “Hey, something is seriously off whenever you travel down this segment of the code”.

Similar to the above is the “one public method per object” rule. Even if an object has ten public methods, there is still a disproportionately high informational return for your first examination of this object. This is because you need to instantiate the object before running any tests, and in doing so you implicitly check whether the instantiation code works. This isn’t the only efficiency either: Often public methods call private methods within an object, and some of these private methods are shared with other public methods. As such subsequent tests will tell you nothing new about shared initialisation and private functionality, and therefore these extra tests cannot help but yield diminishing returns.

Another class of tests that gives magnificent bang for its buck is anything that loops through entire families of interface-sharing objects in one elegant swoop. For example it’s standard practice in Rails to use test fixtures/factory objects to populate the test database with the various objects needed during testing. Since database objects in a typical Rails application all descend from the same base class, ActiveRecord::Base, they share much of the same interface. In particular, their common ancestor bequeaths the #valid? method to its many children. This little method validates the data in an object, which often entails the calling of quite a few custom functions within each object. By creating a test that instantiates all your fixture-/factory-backed objects and iteratively calls #valid? on each one, you’ll arm yourself with an inordinate amount of information about your software’s health. To requisition Martin Fowler’s Churchillian allusion, “never in the field of software development have so many owed so much to so few lines of code”.

Axiom 5: Two tests, verifying the same functionality, may be more or less brittle

Test brittleness is a flaw that can exist within a suite of tests. I define it as “unexpected and unwanted inflexibility”. We all want certain kinds of tests—especially integration tests—to be resilient to changes that are orthogonal to our primary reason for testing a component in the first place. To give an example, if you had a test that’s supposed to ensure your payment processing works, you don’t want that test failing solely because your graphic designer moved the checkout button 2cm to the left.

I’ve found that the tests most likely to suffer from this brittleness are “long setup” tests—ones which elaborately interact with your API before making their final assertions. Because of the sustained nature of the interaction, there are more and more assumptions that have to remain true in order for the test to continue working. To see what I mean, check out this rather involved integration test to guarantee that it’s impossible to create a zero-priced product:

\# Factory data
User.create(username: "Jack", password: "secret")

\# Headless browser activities
visit "/login"
fill_in "Username/Email", "Jack"
fill_in "Password", "secret"
click_button "Log in"
click_link "New product"
fill_in "Price", "0"
click_button "Create Product"
assert current_page.html.htmlselector("#validation_result").text, "Invalid"

Before getting to the assertion, this test carries out a hodgepodge of activities, such as clicking on things, filling out form fields, and finding on-page entities by their exact on-page texts. Specifically, these are the details that have to remain true in order for this test to continue passing:

  • A User model must exist

  • The User model must have a #create method that accepts the username and password parameters

  • The login path must be available at “/login”

  • The login page must have form fields with the exact text “Username/Email” and “Password”. It must also have a button with the exact text “Log in”

  • On the next page there must be a link containing the exact text “New product”

  • The following page must have a “Price” field and a button with the text “Create Product”

  • Pressing this button needs to bring you to another page which contains the HTML ID “#validation_result”

  • This HTML ID must contain the exact text “Invalid” whenever a product is invalid for the reason under test

  • Etc.

For comparison’s sake, here’s the same functionality tested with a unit test. You’ll notice that far fewer assumptions need to remain true for this unit test to continue working. And it will be apparent that this feature should have been tested with unit tests instead of with integration tests.

product = Product.new(price: 0)
assert(product.valid?, false)

Every extra assumption made by your tests is a burden. The point of the above test is to verify that we cannot create a zero-priced product. It would be hugely frustrating if this test started failing when someone changed the button’s text from “Create Product” to “Generate Product”. Both before and after this textual change, any human tester would instantly recognise there is a button he or she must press to progress the flow. But an integration test, hardcoded to click only on the button that exactly reads “Create Product”, cannot think outside its exacting box and thus will topple following this cosmetic modification.

As your test suite grows and grows, failures due to brittleness become more frequent and more annoying. So what precautionary measures can we take to prevent these problems from cropping up?

Suggestion 1: Never change your interfaces

Unsurprisingly, we can slam the brakes on brittleness simply by leaving interfaces alone. (I’m using the term “interface” to refer to whatever the test hooks onto in order to connect with your code—i.e. button text, CSS divs, and method signatures.)

By resisting change we harden up our tests, although this resilience may carry a price tag, which takes the form of dated and confusing class/method names, imperfectly labelled buttons, and lost conversion-optimisation opportunities (due to badly labelled buttons/links in your frontend pages).

Suggestion 2: Minimise the surface area your tests speak to

Let me begin by recounting an outrageously poor testing strategy I once had the displeasure of witnessing: A programmer is tasked with writing a test for a password reminder email. It’s late and the feature is due before he leaves work, so he decides to save himself a few minutes by writing a test that asserts equality of the email body against… the entire email body! That’s right, every single word of the email ought to be present. Hey presto! The test passes and he’s off home.

Of course this test is insanely brittle. Even the tiniest change to the email body—e.g. fixing a typo, tweaking the formatting—will cause his criminally parochial test to fail.

This programmer should have asked “What’s the smallest yet most telling aspect of the email I can base my test on?” This would have led him to doing something like asserting that the email body contains the URL for resetting a password, since this is simultaneously the USP and the MVP of this feature. Such a rewrite would make his test depend on a far smaller surface area, and as a result the test would be correspondingly better equipped to weather foreseeable future changes.

Like the ballet dancer whose outstretched foot just barely touches her partner in front of her, a resilient test ought to just barely touch (or "couple") with your interface.

Suggestion 3: Add an unchanging interface for the sole purpose of testing

There is much in a website, especially in its frontend, that is liable to change. To name but one example, take the frontend text (for buttons, links, H1s, etc.) Given the remarkable effectiveness of conversion optimisation at boosting profits, it’s fair to assume that you’ll want the flexibility to freely experiment. This leads us to conclude that binding test interfaces to the shaky foundation that is on-page text is functionally equivalent to putting a massive down payment on a whopping dose of brittleness.

But there is hope: What if you were to add a set of unchanging hooks to the UI—secure and unchanging iron handles that tests can hold on to while the website evolves? By way of example, you might give your “Create Product” button the HTML ID “#create_product” and then program your tests to interface with this ID instead of with the mercurial button text. As long you promise yourself never to change these old IDs, your tests will continue passing. The difference is that now your marketing team is free to audition various calls-to-action without putting your test suite in jeopardy. You have introduced orthogonality.

Suggestion 4: Inject constants you believe will vary throughout your product’s lifetime

Let me start by immediately undoing the paradox in the title of this section. Imagine your website pays royalties to suppliers and you want to test this feature. Assume too that the royalty you pay is 50%. The naive way to test this feature would be to buy a product worth £100 and then assert that a royalty payment of £50 gets sent. The issue with this approach is as follows: Your business is likely to tinker with its royalty amounts. Today they are 50% but tomorrow they might be 70%. If your tests assume the original royalty percentage was set in stone, then you’ll be looking at broken tests in the future. It’ll be certain breakageddon.

The way to avoid this trap is to parameterise the royalty rate so that you’re able to hold it constant in your test environment while letting it vary in production. To do this, you’ll want to somehow inject the royalty percentage rate into your tests. This could be done at the object level (e.g. you pass an optional ROYALTY_RATE parameter to the royalty calculation code), or it could be done at the application level (e.g. via a configuration system that can accommodate differing royalty rates depending on whether the software is run in test or production environments). With injection systems like this in place, you’ll be able to peg the royalty rate to a convenient number (like 50%) that will remain unchanged forever more. This would insulate your tests from break-inducing fluctuations that occur in your production environment.

Suggestion 5: Be intentionally and artfully underspecific with your test expectations

Computers and their programmers emphasise precision in their approaches to problem solving. But sometimes this emphasis can end up misplaced and cause problems down the line.

One such example of overbearing precision is tests that expect data to be just so, even though that data is only tangentially relevant to the feature under scrutiny. Take, for example, an integration test of a feature that prints out a CSV of royalty calculations. (Assume that elsewhere there are unit tests which verify that the royalty calculations are accurate.) If your CSV integration test expected the first row of the output file to contain the figure £45.15, then this test is too strongly coupled to the royalty size. Given that the mathematical correctness of the royalty calculation is already verified in your unit tests, it’s wasteful to repeat this specific expectation here. You’ve created an unnecessary overlap, a wasteful injection of brittle. It would have been more than sufficient for your integration test to merely verify that there was a single row in the CSV file output, leaving out unnecessary specifics about the royalty size.

This pattern of underspecifying can apply in all sorts of ways. Instead of asserting equality of numerical amounts, you might simply check that they are within certain bounds (e.g. between 2 and 3) or meet certain conditions (e.g. the amount of royalty was less than the product price). Instead of checking that string data is exactly matched, you might instead assert that it merely contains a certain keyword.

Since we’re on the topic of string matches, I’d like to bring up another example of misplaced precision: Tests that interact with your program by latching onto a character-for-character, case-sensitive match of a button’s text. Were this button’s text to change—even slightly—the tests depending on it would break.

A human manually testing an application would be unlikely to be tripped up by this cosmetic change, which leads me to ask: Why do humans and machines depart in their abilities to overcome ambiguity? If we answer this question, we might arrive at some ideas for designing the next generation of mega-resilient tests. Here goes: I know that, after I’ve filled out a form on a web page, the button that is the most likely candidate for submitting the data is the final one contained within the form. This leads me to think that integration tests could be programmed to click the last button in a form, thereby avoiding the need to interface with any specific text/HTML ID that might change. By being underspecific with test expectations, the test may end up more resilient.

Let’s look at another example in our search for heuristics: If it were my job as a manual tester to locate the page containing recent orders in an unknown new CMS, I’d find it by clicking around the various admin pages until I happen upon a link saying something like “orders” (or until I arrive at a page titled “orders”, or a page with H1 text that read “orders”, etc.). Perhaps the next generation of testing tools will work in a similar manner, exploring the UI using common sense heuristics, thereby obviating the need for us programmers to be so darned specific. At the very least, I’d like to see tests that attempt to repair themselves after breaking.

Suggestion 6: DRY up repeated test steps

This suggestion is so obvious that I cringe at including it, but I’ll leave it in anyway, just for the sake of completeness: Testers ought to group frequently used testing steps (“buying a product” or “logging in as admin”) into reusable, shared methods. I firmly believe that test code should be no less DRY than production code. By keeping your test code DRY, the potential damage caused by interface changes is minimised and contained.

Axiom 6: Conventional software tests aren’t suitable for catching all kinds of bugs

Techies are highly attuned to “pure software” failures—fatal errors that take down servers, exceptions causing localised confusion, and other sorts of incorrect software behaviour. But too narrow a focus on this class of errors can cause us to miss other, equally damaging problems with our system. Here’s my take on the problems that are not yet possible to automate away.

1. Text, typos, contractual terms, and continued accuracy of communications:

Within this category I’m putting things like the instructions on your on-site forms, the sales copy on your landing pages, the legal provisions within your terms and conditions, the text within your transactional emails, and the promises you make in your advertisements. All of these texts can harbour typos or—far more problematically—inaccurate or outdated information.

Example time: Let’s say you sell bookkeeping software. Because tax law changes year on year, your customers care greatly about having the most recently updated product. Knowing this, you create Google AdWords campaigns that advertise your software using phrases like “updated in 2015”. You create these ads once, see that they are profitable, then let them run indefinitely. The next year rolls in and you update your software and also tweak your website so that it advertises your latest 2016 update. But you forget about updating your online adverts. (Your ad blocker precludes you from seeing them.) Now your advertisements will be little more than an expensive means to drive customers to your competitors, whose advertisements do promise compatibility with 2016 updates to tax law. Since this textual bug strikes without explosively announcing its presence through exceptions, downtime, or complaints from enraged customers, it can go unnoticed for quite some time, insidiously siphoning your website’s revenue.

Automated software tests aren’t realistically going to catch these kinds of errors. Sure, there are semi-automated solutions for catching particular types of text errors (e.g. spellcheckers help you catch simple typos). But we are nowhere near having a general solution for routing bugs from text. The best preventative today is still a second pair of eyeballs that proofreads your copy and ensures continued future accuracy through a prospective timetable of scheduled reviews.

2. Security issues

In the mental rush of building our software’s unique functionality, we programmers sometimes forget how important it is to build accompanying protections against malicious use and abuse. Remember: Security holes can outrank even the most catastrophic of classical bugs in their destructiveness. To name but a few of the most common issues:

  • Are passwords, credit card details, patient medical histories, and other sensitive details filtered from logs?
  • Is this same sensitive information—when stored in databases— encrypted so that a malevolent team member (or a freelancer turned rogue) is prevented from leaking it or selling it on the dark net?

  • Are the form fields where customers interact with the website screened for susceptibility to SQL injection attacks? SQL attacks can cause your web application to reveal secret information, allow dangerous edits, or even leave its database vulnerable to being deleted.

  • Are the web pages that render data previously entered by users (as in the forms above) screened for vulnerability to Cross-Site Scripting (XSS) attacks? If not, a hacker can inject malicious JavaScript into the pages of your website that other users/admins will visit. This potentially allows the hackers to steal these users’ (or administrators’) cookies, enabling the hacker to impersonate their victim vis-à-vis your website.

  • Are you keeping sensitive information, which should be hidden from users, in your cookies? Remember cookies can be easily read by the users whose computers they reside on.

  • Is it possible for Cross-Site Request Forgery (CSRF) attacks, in which an attacker on another website exposes one of your users—who is still logged in to your website—to some code that executes requests back to your website? Catastrophic results await your users following malicious requests such as ones transferring money from the victim’s bank account to the attacker’s, or ones modifying the victim’s email credentials to those of the attacker, thereby giving the hacker access to the victim’s identity following a password reset.

  • Does your software have mechanisms to prevent a hacker from brute forcing the passwords for your user accounts?

  • Are any of the libraries you rely on (especially those to do with user authentication) compromised and therefore in need of a security fix?

Increasingly there are automated tools for identifying security holes, such as Tarantula and Brakeman in the Rails world. If available, these libraries are great places to start and you might consider including them within your automated test suites. But for the most part, you’re going to have to audit your application (and each additional feature) manually.

3. Inadvertent data bleeding

In business, information asymmetry is power, so it’s often in your interest to conceal data from outsiders. For one you don’t want your competitors knowing how many customers you have, lest they see through your bluffs.

What’s more is you don’t want brattish startups poaching your users. Do not underestimate the extent to which competitors will push and prod at your site, probing for valuable information or leads they can steal. Many a website owner has fallen victim to crawlers that systematically message everyone on their system with the message “Hey, have you checked out websiteXYZ—they’ve got an even better marketplace than this one. x”. This practice is, of course, illegal, but good luck trying to prove the other website’s guilt. It’s the Wild West online—and it’s up to you to secure the locks.

There are plenty of other reasons to curb data bleeds: To avoid revealing private information, to stymy hackers/web crawlers, and to enforce the paywalls guarding your content.

Now that we’ve seen the value in stopping data bleeds, let’s examine what I believe to the most commonly overlooked source of these problems: guessable, information-revealing URLs. For example, the URL “/orders/ireland/653” tells me that there were likely 653 orders in Ireland, just as “/users/32” tells me that there are at least 32 users in your database. A few years back I bought a digital download VST (software that simulates a musical instrument). After paying, I was redirected to a page “/orders/145”. Out of curiosity, I typed in the URL “/orders/144”, and sure enough, I ended up on a page with $500 worth of music software available for download. Because the sequence within the URL was easy to guess, I was able to gain access to content that was otherwise hidden. (Of course this website was doubly flawed: It should have had some privilege-checking/authorisation code to prevent me from accessing order pages that weren’t mine.)

Another cause of data bleeding is when you render information valuable to unscrupulous competitors in easily scrapped HTML text. For example, a naive platform that connects two sides of a marketplace might display the contact details of the suppliers in HTML instead of mediating access through a contact form. These HTML-based email addresses, phone numbers, and full names can then fall easily into competitors’ hands.

If you don’t want to mediate all these initial communications through your contact forms, you might consider rendering the information in a more-difficult-to-scrape image or through some JavaScript function that does its best to hide its content from non-human readers.

Yet another cause of data bleeding is when your HTML and JavaScript source code leaks revealing tells. For example, web application frameworks sometimes generate HTML IDs based on database information (e.g. “#user_324”). A savvy competitor need only inspect the HTML to figure out the size of the web app’s user base. For a leak that’s more financially damaging, I’d like to recount my experience with a website for buying audio loops (the auditory equivalent of stock photos). This website lets visitors listen to audio loops before they choose to buy them. Upon inspecting the HTML on one of these pages, I discovered that the JavaScript for their sound player referenced an MP3 file. By CURLing this, I was able to download the loop without paying. This company made no sophisticated efforts to hide its data, and a malicious hacker could have downloaded the entire library with a powerful enough scraper.

4. Visuals

Today, with so many browsers and internet-enabled devices equipped with a panoply of screen sizes, the number of ways your website’s design can fail is mind-boggling. As such, one of the most basic rudiments of releasing any site today is to test it in multiple browsers, devices, and screen sizes before showing it to the public.

I didn’t always heed this advice: Some time back I developed a site in Chrome and got it so everything looked crisp. I launched, proudly broadcasting my creation to everyone I knew, and then celebrated by going out to a restaurant with friends. Everything was great—until my phone starting ringing. “Hey, that website you announced on Facey-B doesn’t work on Safari…”. Feeling the embarrassment swell, I abandoned my meal, ran home, and worked into the wee hours to repair the design and salvage as much of my slaughtered reputation as possible.

Thorough visual testing is demanding, so it’s worth having a rule of thumb for when it’s necessary to expend the effort. I’d argue that this is the case when there’s a risk of total design failure, i.e. a visual bug so bad that a set of visitors won’t be able to use the site at all. By contract, mild visual issues do not warrant this degree of testing.

As for when there is a chance of total visual failure, this is mostly after initial release or after any redesign that modifies the layout (i.e. that modifies the outermost frames within which the other content sits).

And oh, did I mention that this same point applies to automated emails too? As you are probably aware, emails display differently depending on how and where they are viewed. Therefore we ought to make sure our emails look good across the following permutations:

  • webmail vs. in an app vs. desktop client

  • mobile vs. tablet vs. laptop

  • image-display enabled vs. image-display disabled

5. Tracking

It seems like everyone on the internet uses Google Analytics, at least in part, to handle their tracking needs. While the software is, for the most part, stellar, it suffers from the lack of options to repair/remove corrupted historical data.

I’ll share another story from my naive days: My early implementation of Google Analytics ecommerce tracking was botched such that my website would count revenue twice whenever the customer refreshed the “/thank_you_for_ordering” page. This caused my Google Analytics ecommerce data to show inflated revenues. And because of this tainted data, future analysis of larger time periods containing the dates when the bug was active would also show incorrect figures; the damage to my data was irreversible.

These inaccuracies are annoying when you’re aware that your data was contaminated, but they are downright dangerous when you—or another member of your team—forget the data was damaged and proceed to make business decisions based on fantasy figures. For example, your company might calculate how much to bid on advertising based on ecommerce sales data in Analytics. If your sales figures were inflated, this would lead you to bid more than the actual sales permit. Because of the tight margins in advertising, this inflated bid might then tip the advertising campaign from profit- to loss-making.

How can one avoid bugs in tracking? The most pragmatic, easy-to-implement piece of advice I’ve ever heard is to schedule a sanity check soon after a deploy affecting tracking code. For example, immediately after deploy, you would open up Analytics and compare the results calculated by your web application (e.g. the revenue generated) with the results recorded in Analytics.

More proactively, you could configure the tracking platform to email you a daily report containing the figures most important to you. These you could cross-check elsewhere. Additionally you could ask the tracking platform to send you alerts when certain figures fall way outside their usual ranges (e.g. if no sales are reported in the past X hours on a website that usually has a sale every minute). Anomalies like these most likely indicate that something has gone wrong with tracking as opposed to the market. As such, these alerts help you spot issues that were so unpredictable that it never even occurred to you to keep an eye on Analytics right after deploying. In other words, these alerts cater to the unknown unknowns.

If the last few suggestions seem more like damage control than prevention, it’s because they are; this shortcoming is justified because the aforementioned techniques are relatively easy to implement. That said, teams who are more dependent on accurate data are welcome to pursue more involved precautions, such as pulling all their Google Analytics data into their own system and saving a local copy. This would grant them the hitherto missing power to turn back time and repair damaged data.

6. Chaotic algorithms

Here I mean chaos in the sense of chaos theory, i.e. that which is unpredictable despite being deterministic. A common programmatic example would be search algorithms that rank results by taking into account tens of factors (e.g. text content, searcher location, temporal recency). Due to the imposing number of factors influencing ranking—and the immense variety of values each of these factors may assume—it becomes near impossible for automated tests to validate the continued correctness of an evolving search engine algorithm. Put another way, small changes to your feature set (e.g. the addition of a new ranking factor or a reweighing of an old one) may cause large changes in expected rankings, thereby eroding the value of existing automated tests.

What can one do? Those with luxurious budgets—like Google—hire teams of quality assurance staff who manually evaluate various search results for relevancy. This effort is then supplemented with machine-learning algorithms that do all sorts of things, like checking whether a user stopped searching after seeing a certain result—an indicator that the result shown was satisfactory.

But we’re penny-scrimping web application owners, not Googles. The best I could manage within my budget was to schedule manual tuning sessions in which a developer would spend a day with my on-site search engine, evaluate the suitability of the rankings, and then tweak to taste.

Axiom 7: Divergence between test and production environments creates blindspots for your tests

Seeking convenience in test creation, configuration, performance, and cost, programmers may only approximate their production environment in their tests. Maybe the clunky-to-install Postgres database is replaced with an in-memory database simulation. Perhaps parallel processes which support the main software (e.g. Memcached for caching, Redis for counters, and Solr for full-text search) are abandoned in testing because the team doesn’t want the hassle of initiating five processes every time it wants to run a test. Or maybe the variety of paid SASS services in production is stubbed out in the testing environment so as to avoid paying double as much in subscription fees. Or perhaps the bulk of the testing happens on the developers’ MacBooks, whereas the real software runs on Linux servers.

The examples I’ve given thus far are obvious, in the sense that you know full well that your tests deviate from the production environment. But there is a more insidious source of divergence, one that you are less likely to keep at the forefront of your mind: differences in data/state. Your production database might hold many gigabytes of user-contributed data—complete with all sorts of malformations like unparseable characters—whereas your test database probably only holds a few kilobytes of artificially well-formed data. This difference means your tests may misrepresent how well your website functions with realistic loads of honest, gritty, real-world data.

So, given all this, what techniques can you adopt to help you improve your testing?

1. (Obviously) mirror the production environment in your tests

The unsurprising, although not necessarily easy, solution to the “obvious” problems above is to replicate the production environment to the fullest extent possible in your tests. Install Postgres; figure out how to quickly initiate all the subsidiary processes whenever tests begin; take the financial hit and purchase an extra, testing account subscription for integrated SAAS services; test on a Linux box running in the cloud rather than on your MacBook.

2. Exclusively use tools/libraries that work cross platform

The standard libraries of programming languages like Ruby or Python mediate access to operating system functionality (file creation, process management, etc.) so that programs availing of this functionality will do exactly the same thing regardless of operating system. By relying on these conveniences instead of writing platform-specific code, you insulate yourself against annoying disparities.

Leaving aside programming languages and focusing instead on programs, it’s worth remembering that you’ll sometimes find familiar programs rebuilt for cross-platform compatibility. For example, the text stream editor sed on OS X has a different API to its eponymous Linux counterpart. But there is another program for OS X called gsed which works identically on both platforms, making it a much better candidate for portable code.

3. Police your production server for data integrity

You’re about to deploy a new feature to the Product model of your legacy ecommerce web app. This feature necessarily makes assumptions about your data. In particular, each Product instance is assumed to have a parent Owner object. Your test fixtures encode this assumption, so it’s no surprise that all your tests pass with luscious green. But once you deploy you’re in for a shock: 0.01% of the Product objects do not have associated Owner objects, and these orphans trigger a trickle of exceptions.

It may not have been unreasonable to have assumed that every Product instance had an associated Owner object. You might have checked the code and seen that there were heaps of validations precluding the existence of ownerless products. However, if you think about this more carefully, you’d realise these code-level checks cannot­—at least without augmentation—guarantee anything about the data. What would happen, for example, if the data was saved before the validations were even added to the Product model? Or what would happen if someone directly modified the data in the database instead of via the Product model?

The solution to this class of problems is to take an active approach to policing data integrity. This could be achieved by means of daily scripts that validate your data and by the addition of copious amounts of database constraints (as opposed to code-level constraints). Now for the sake of argument, let’s say you didn’t get around to implementing the aforementioned practices but still had a deploy coming up. Here, the next best approach would be to do a narrow sanity check on the subset of the production data pertaining to the code about to be deployed. In our example above, you’d simply craft a database query to check whether all products do indeed have owners.

Axiom 8: Test creation time is lopsided toward a project’s beginning

This point is mostly for inspiration during those trying times when you feel like you are drowning in a neverending sea of test creation and maintenance. Let me explain: The first few weeks of any new programming project contain a disproportionate amount of boilerplate work, such as installing bog standard dependencies, configuring impossibly dull settings, provisioning stubborn servers, and amassing credentials from an endless stream of third-party service providers.

Once all this tedium is past you, you get to begin the actual work of building features. Usually it’s the case that a business priorities the programming of features that are central to its ability to generate value and public interest. Given that new businesses are usually looking for some sort of edge over incumbent players on the market, it’s likely that its first few features will be rather unique. This uniqueness necessitates a good chunk of thinking and figuring out how to test.

Let’s look now at the trajectory of test additions. At project birth, a testing harness is laid in place, complete with separate test-only databases, data fixtures, convenience methods, and so on. The first test cases for the project are probably going to be the most difficult to write because (1) you’re working with your unique, special sauce features; and (2) because you need to figure out a convenient way to test whatever core functionality the software is offering. Say, for example, that your software outputs Excel files. It’ll be disproportionally time consuming to write the first test that reads these files and determines whether they match your expectations. But subsequent tests which merely reuse this mechanism will be much, much quicker to write.

In conclusion: The economics of testing demands herculean effort at the beginning, but this heaviness soon subsides. Hang in there.

Axiom 9: Tests double in value with each additional programmer on a team

Imagine you’ve spent five years maintaining your web app. By then you’d be thoroughly familiar with all its quirks and idiosyncrasies. It’d be patently obvious to you that, say, the royalties system is still entangled with the discounts system, and that any changes to one will necessitate changes to the others. God knows, that one had come to bite you enough times…

This tacit knowledge—about coupling, about program components that resist change, about gotchas, and about overarching architecture­—is lost on programmers new to your project. Sure, documentation and code comments go some of the way to communicate these insights, but—let’s face it—five years’ worth of intuition is not easily summarised.

This dissipation of valuable experience can be (somewhat) mitigated through automated tests. Essentially you train your software to complain by flashing a red light whenever a darling principle is breached or whenever an entanglement is ignored. The ghost of the old programmer thereby guides the new.

A word about why tests are a superior way to encapsulate this kind of knowledge: Unlike documentation, tests have teeth. If a future programmer ignores the explicit advice written into the documentation, nothing happens. If the same programmer ignores the implicit advice encoded into a test, that programmer will have some loud failures to answer to.

Axiom 10: Even if you don’t know it, you are already testing…

When programming in an interpreted language, you’ll often keep the interactive console open and type in quick commands and then inspect the output. This is nothing less than a once-off test. Now, doesn’t the fact that you went out of your way to probe this functionality in the console indicate that its veracity is something worth monitoring? Why not, then, copy the same lines over into a dedicated test and ensure that this little sanity check is run every time—not just this one time. TL;DR: Recycle your console tests.

Aside: This style of programming (and therefore the possibility of recycling) assumes your functionality is bundled up in console-testable objects instead of being only accessible through a more difficult-to-script web GUI. Said differently, for this recycling to work, your code ought to already be designed for testability (as I expand upon in another article).

The next way in which your code is tested without you explicitly writing tests is by your users and staff. After you’ve got some traction, many sectors of your application will be put to regular use, e.g. users reading articles, making credit card payments, etc. Assuming that you’ve fitted your infrastructure with sufficiently sophisticated exception notification machinery, you will be kept apprised of the more explosive technical issues afflicting your software. In addition, different problems will be brought to your attention through staff observations and customer complaints. Whereas exception notification software is able to catch the likes of missing constants, unavailable functions, mismatched data types, or unacceptable inputs, your users and staff will notice issues like typos, bad CSS causing unclickable buttons, or logic errors that cause syntactically valid but nevertheless wrong output (e.g. the software uses the wrong tax rate in a certain region). To the extent that you/your staff interact with your code through the same interface as your customers, you’ll be better equipped to spot these issues. TL;DR: Prefer “overlay” administrative areas to ones relegated to separate areas.

The last tests you are likely already carrying out without being aware of are those based on results produced by third-party analytics software. Assuming you have Google Analytics set up, it only takes a few clicks to list the ten slowest pages on your website, thereby pinpointing unoptimised load times that damage conversions. Indeed, upon my first reading of such a report, I discovered—to my embarrassment—that a major landing page had a load time of more than twenty seconds! Another Google product—Search Console—lists all pages that 404, giving you a heads-up about all your broken links. TL;DR: Take a gander at analytics tools for insights into your website’s health.

More Articles:

Web Developers: Harmonise Your Time Zones

How it's all too easy to wind up with divergent time zones across a modern web stack

What Happens If A User Clicks A Button Twice

The most common source of bugs in web applications?

Dealing with Test Leakage

Strategies for pinpointing and extricating indeterminacy from software tests.