Many websites have the exact same (or very similar) content viewable through the different URLs of their various websites. This is known as “duplicate content” in the SEO world, and it’s something you want to avoid.
What’s the problem with duplicate content? To start with, a website containing the exact same text repeated over and over on many different pages comes across as spammy, and search engines could penalise it for this reason.
Moreover, duplicate content causes all sorts of technical problems for search engines.
Firstly, search engine crawlers, even at the best of times, usually don’t index every page on a website; instead, they make informed decisions about what content is worth keeping around and what content should be chucked aside. When the crawler encounters duplicate content, it (sometimes) takes this as a signal to ignore all but one of the duplicates. But how does the crawler pick which version to keep? That decision isn’t easy for a machine, and the website owner risks the algorithm indexing some odd-looking, confusing version of the web page instead of the perfectly gorgeous page intended for human readers. Instead of putting their best foot forwards, the webmaster with duplicate content is instead presenting a smelly sock riddled with holes.
Secondly, if, as may happen, more than one version of the duplicate content happens to get indexed, search engines don’t know how to apportion ranking authority across these duplicates, leaving webmasters open to the undesirable possibility that said ranking authority is split across all the duplicates and diluted so far that the webmaster can no longer compete with rival businesses who have focused their ranking authority into a few no-duplicate URLs.
Duplicate content issues crop up if your website has two or more distinct URLs that point to the same page (or wad of content). This sounds simple, but the subtleties of this idea can get lost on casual web users. Specifically, you should worry about duplicate content if any of the following apply to you:
- Your website exists in cloned form across multiple domains and subdomains. The most common case of this happening is when your website responds to and serves up the same content for both “www.mywebsite.com” and “mywebsite.com”. This essentially creates duplicates for every page on the entire domain—the only difference being that one version has “www” in front and the other doesn’t.
Your website has multiple URLs for accessing the same content. For instance, “/t-shirts” and “/category/t-shirts” might point to the same page (i.e., to the one that displays the t-shirts you offer for sale). This problem often occurs on the home page, which, on old-fashioned websites, can be accessed variously with “/index.php”, “/home”, and the root URL, “/”.
You append parameters to URLs, and some permutations of these parameters lead to the same or very similar content as other permutations. For instance, sort parameters, such as “/t-shirts?sort=price_asc”, lead to the exact same content, only rearranged. Sometimes filters parameters lead to duplicates too, such as when “/t-shirts?review_score_greater_than=3” (which filters a page to only show items with review scores over 3) acts on a data set in which every item has a review score of least 4 anyway, meaning that the filtered page is exactly the same as the unfiltered original. Search engines don’t automatically strip these parameters out, as you might expect.
Your website responds with a HTTP status 200 to both “/products” and the equivalent with a trailing slash, “/products/”. Surprisingly, Google treats both as distinct URLs, explicitly stating as much in their official webmaster literature.1 They mention that this practice is “often OK” but not “perfectly optimal”. A careful webmaster should ensure that their website only responds to one of these slash possibilities.
Your website shows the same content to both uppercase or lowercase versions of a URL (e.g., “/taxons/t-shirts” and “/taxons/T-SHIRTS” both lead to the same content).
Just to complicate issues even further, there are cases when you specifically want the duplicate (or near-duplicate pages) to appear within Google Search results. Most often, this is with translated/regionalised pages that you’d like to put forward as local organic entry points to your website, to be shown or hidden in the results depending on where the searcher is based and in what language they are searching. In this specialised case, please ignore the ensuring tips for avoiding duplication and instead look to the hreflang directive, as elaborated upon in the chapter on SEO internationalisation.
Now, back to banishing undesired duplicate content.
Redirect Instead of Responding
This should be your first line of defence against duplicate content, owing to its ease of use and its low impact on SEO link juice when compared to the nuclear option, the robots.txt directive (which we’ll see later).
Let’s say you want your customers to be able to visit both “www.mywebsite.com” and “mywebsite.com”. Instead of displaying content for (and responding 200 to) both possibilities, you should tell your server’s router to respond with a HTTP 301 response code (permanent redirect) on one of the two possibilities, redirecting all those requests to your now officially sanctioned choice. The web page to which the redirect points will receive at least 90% of the redirected link’s ranking power.
While we’re on the topic of redirects, let’s highlight another of their purposes in SEO: preserving link juice on retired or edited URLs. As you know, any given page on your website can potentially accumulate reputation, links, and Google ranking in organic searches. But whenever you edit its URL name (or drop the page altogether), you reset the counters at zero, throwing away every last drop of the SEO juice earned by that URL. In these circumstances, the 301 response code comes to the rescue. Its effect is a memo to Googlebot saying, “Hey Googlebot, the old page you’re looking for has relocated. Please update your index and transfer all my SEO ranking accordingly. Kthxbai.” This point is important. Indeed, I would go so far as to say that editing a URL name without placing a 301 permanent redirect on the old one is as about close to SEO suicide as I can imagine. Yet it happens all the time.
It is tricky to ensure that redirects are always created after editing old URLs. Not only do you need to watch out for programmers who edit the URL structures, but also admin staff who edit links in the CMS and regular users who change their own content, as they so often tend to do. To meet these challenges at a systematic scale, consider implementing non-changeable permalinks attached to each piece of database content, thereby freezing the original URL and immunizing it against change. If this isn’t possible (say because it is necessary for your customers to change URLs), consider a database-backed system that automatically remembers your old URLs and responds to requests for them by 301-ing them forward to their latest incarnation.
Unlike with redirects, with canonicalization your server will continue to respond (with HTTP status code 200) to the duplicate URLs. Instead, there will now be snippets of HTML added to the duplicate pages, and the purpose of these is to explain to Google which page is the “canonical” one (i.e., the definitive one that should show up in their search results and accrue all the search juice).
Here’s an example of canonicalization in use, as you might see on a page “http://www.mywebsite.com/index.php” which duplicates the root URL.
<link href="http://www.mywebsite.com" rel="canonical">
Canonicalization is a useful advantage when you want to have distinct versions of your content for other people to bookmark or link to (e.g., you have distinct pages for printer-optimised alternatives for your content).
Just as with redirects, canonicalization preserves the link juice of the duplicate pages.
The href attribute of the canonical URL tag accepts fully qualified absolute paths (includes “http://”) and relative paths (doesn’t). I recommend sticking to fully qualified absolute paths, since poorly formed paths can potentially lead to insidious, damaging errors (see Google’s warning on this topic2). For example, if you write “mywebsite.com/t-shirts”, it will attribute your SEO juice to “http://www.mywebsite.com/mywebsite.com/t-shirts”—a non-existent page. This happens because the initial “/” was missing in the relative path.
By the way, there’s a potential gotcha when using absolute paths: You need the proper protocol (“https://…” vs “http://…”)
Write a Robots.txt or Meta Robots Tag
The robots.txt file, which should be retrievable at the URL “/robots.txt” of your website, is a directive asking web robots—in particular search engine crawlers—not to crawl specified pages on your website. This can be thus used to stop crawlers indexing duplicate URLs.
Despite this ability, I would nevertheless recommend against employing the robots.txt for these purposes, and I instead advise you to reach for redirects or canonical URLs (or the meta robots tag…).
Why so? Duplicate URLs sometimes accrue decent SEO juice (e.g., the mobile-friendly version of your website [“m.mywebsite.com”]). Ideally, you want to channel this SEO juice into your main page (e.g., “www.mywebsite.com”), so as not to lose any authority. The problem with listing URLs in the robots.txt file is that you lose ALL their SEO juice. As Adam Audette puts it, it is a “Pagerank dead end”, “a sledgehammer”.3 This waste doesn’t happen with other duplicate-content mitigation mechanisms such as redirects and canonicalization, which are both much more effective at preserving SEO juice.
All that said, there are good use-cases for the robots directives. Principal amongst these is removing low quality non-duplicate content from Google’s index. At first it might seem counterintuitive to want to reduce your website’s crawlable surface area, but for many webmasters this approach pays dividends. Having a robots.txt restriction stops crappy entry points to your website from appearing in your search engine results. For example, at my company, Oxbridge Notes, we have an (admittedly crappy) URL design whereby every product page links to a nested page for sending questions to the seller of that particular product. That gives us URLs like “/land-law-notes/buyer_inquiries”, “contract-law-notes/buyer_inquiries”, and so on, thousands of times. I didn’t want these content-dry generic question-form pages littering up my search results and competing with my optimised landing pages (the parent product pages, such as “/land-law-notes”), so, to this end, I updated my robots.txt notice with a directive not to crawl these “buyer_inquiries” pages.
It’s true that I probably shouldn’t have designed the website with this crappy structure, but, at this point, there’s no point in fixing something that isn’t too badly broken, especially when a quick entry to the robots.txt file dissolves away the issue.
Another good use-case of robots.txt is privacy. In the country where I reside, every website owner is required to place their contact details on their website on a page known as an impressum. As an individual running a web business with remote workers, I don’t have a dedicated office. In this case, the law requires me to place my home address on the impressum page. This makes me uncomfortable about both my privacy and my safety. Not too long ago, a nasty mugger attacked me and threatened that if I report him to the police (which I subsequently did), he would wait for me outside my apartment (which, mercifully, he subsequently did not). This threat had teeth though, because I knew that all it would take for such a malicious character to find out where I lived was to google “jack kinsella address”. (He knew my name from the ID card he stole from me.) The robots.txt could have, perhaps, reduced this risk, say if I instructed search engines not to index the page containing my address.
Robots.txt can also preserve privacy in less dramatic circumstances. For example, some webmasters leverage it to hide content and features not yet officially launched.
If you really want to restrict crawler access to content with the robots directives, I would advise you not to rely on the classic robots.txt file, but instead to rely upon the newer—and in my opinion superior—meta robots tag.
The big problem with the robots.txt file is that it doesn’t quite do what it suggests it does; it only instructs search engines not to visit the URL. Surprisingly, search engines will nevertheless index the existence of that URL and display ugly, stripped-down, contentless ghost entries for search results. You probably don’t want that:
These shallow results present a poor image and poor user experience to potential customers on Google, and, moreover, they might leak information to hackers or competitors—information that you’d otherwise wish to keep private, such as URL entry points or revealing parameters nested within URLs.
In order to circumvent these limitations, a webmaster can insert meta robots tags within the HTML head of relevant pages, eschewing the need for entries in the robot.txt file. These tags accept multiple arguments, which allow for more flexibility in directing Googlebot how to treat each entry.
<meta name="robots" rel="noindex, follow">
The above example shows a meta robots tag instructing Googlebot not to index a page, but to “follow” (i.e., attribute link juice to) any links on that blocked page, meaning that external links to this “roboted” page won’t be completely wasted as they would be if we had used a robots.txt entry. (Thus, this “follow” directive essentially undoes the major link juice disadvantage of regular robots.txt entries.)
For a full description of the various arguments accepted by meta robots tags, consult Google’s guide on the topic.4 (Interesting example: The instruction “noarchive” directs Googlebot not to include a page within Google’s cache, a nice boon for privacy).
Last point: The meta robots tag is a property of HTML. What if you want to restrict the indexing of non-HTML content such as PDF files? The HTTP X-Robots-Tag header fulfils this role and accepts the same arguments as the meta robots tag, affording you the exact same feature set for content in alternative formats.5