One of the most perplexing issues Webmasters run into year after year is why search engines like Bing and Google continue to display duplicate content in their search results despite their numerous admonishments to reduce duplicate content. Will search engines ever eliminate duplicate content completely? No. They have no reason to do so and much reason NOT to get rid of all duplicate content. But there are problems with the way marketers look at (and for) duplicate content.

To a search engine some duplicate content is more useful than other duplicate content, but the context of the search engine’s evaluation determines the value. In other words, duplicate content may be useful in one context and not-so-useful in another context. Also, the duplicate content can be evaluated better if it is embedded in a usable structure.

You will have to let me slide on explaining “usable structure” for now. I’m afraid that is a huge concept that could take several articles to cover. However, a quick way to understand some facets of the “usable structure” tool would be to study Information Scent and Information Architecture articles like those Shari Thurow occasionally publishes. Here is a short promotional video where Shari explains Information Scent (I derive no financial benefit from referring you to Shari like this).

Information scent is useful to search engines in many ways; I don’t know, of course, which search engines are developing metrics around information scent but I am sure they use it as part of their semantic and spam analysis algorithms. When you think about the natural way an informative Website guides its visitors toward all kinds of information, versus the way an affiliate Website that is only seeking conversions may guide its visitors, information scent can provide clues to which site provides the better visitor experience. Two sites offering the same information may create entirely different visitor experiences, but it doesn’t always come down to who monetizes the information the most.

Duplicate Content Is Good When It Informs and Helps the Visitor

I am not talking about disclaimers and navigational cues. In fact, disclaimers and navigational cues are abused by many aggressive Web marketers and they become little more than page clutter (that link opens a new browser window to an article on the SEO Theorist blog).

No, what I am talking about is informative duplicate content that reassures visitors in some way. Examples of informative duplicate content include quotations and citations, screen captures, embedded videos, addresses, telephone numbers, lists of things or people or places, recipes, technical “how to” explanations, jokes, poems, historical anecdotes, etc.

This kind of duplicate content constitutes “communal wisdom” and it defines what we could call a “meta lexicon” of popular memes. It is from this communal wisdom that a search engine draws inferences of what is popular, possibly correct, widely accepted, desired, needed, or discredited (among other contexts). Communal wisdom can be wrong for any number of reasons but at the end of the day it is not the search engine’s fault if it points people to bad information if that is all there is or if the bad information outweighs the good information.

Sure, the search engine bears responsibility for what it includes in its results, but remember that it is also just trying to help people find what they are looking for. We have not yet developed technologies that allow us to create Fairy Godmother Search Engines that know what is best for us.

The popularity of bad information within communal wisdom does not absolve a search engine of responsibility for making bad choices; but it does complicate the process of making useful choices. In search engine optimization (from both the engineering side and the marketing side) being useful is more important than being correct, complete, or accurate. I hope this changes some day. I realize that one of the goals that Google’s Knowledge Graph pursues is to deliver information to searchers that is useful, correct, and accurate (and within some limits also complete) but it still has a long way to go to get there.

If search really had to deliver correct information most marketing blogs would never appear in SERPs for popular keywords. I’m not suggesting that Google intentionally encourages people to follow bad advice; rather, correctness is a long-term goal, not something that can easily be achieved. There are whole theoretical sub-sciences built around the idea of correctness. It’s not easy to be correct.

We use duplicate content to show people that we “know” what we are talking about. Of course, this is a two-edged sword. Any idiot can cite an authority, just as any authority can cite an authority. Duplicate content does not differentiate between idiots and authorities. But if you’re an idiot who links to the right authorities you’re at least involved in the conversation, and the search engine is trying to follow the conversation.

Too Much Duplicate Content Is A Problem

If you publish 100,000 pages on your Website then including a 3-paragraph disclaimer on every page is probably not useful. People will tune it out. On the other hand, if there is a chance that people will hold you morally or legally responsible for whatever is published on those pages you probably need some sort of highly visible disclaimer (1-2 sentences) that links to a more-informative single page that explains your disclaiming policy in detail.

So, yes, we need to replicate some information across 100,000 pages but not much. Imagine that you need to find your site’s legal policy in Google. You type in a few words like “Honest Joe’s Website legal policy”. Up comes 1,000 listings of forum discussions. Oh yes, the policy is embedded in the footer on every page. You’re satisfied. But the casual visitor will be confused by those results. What you want, for that rare individual who actually looks at your TOS page, is for all your pages to indicate (in a highly visible, user-friendly way) that the legal policy is located HERE on a page that is unambiguously dedicated to that purpose. Everyone wins.

While that may seem like a small win, suppose you face a legal challenge that makes the news. Suddenly everyone is linking to your legal policy page as they discuss the situation. That works better than if they all link to some random page in the forum.

Out on the Web, you also have to deal with too much duplicate content. For example, suppose you want to find who originally published a picture of your favorite celebrity. Good luck figuring that out because any popular celebrity image will quickly be incorporated into fansite galleries, commercial image galleries, and replicated on hundreds or thousands of blogs. A lot of people will go to considerable lengths to remove or obscure any copyright information; some sites embed their own logo, URL, or copyright notice in the images (this is illegal but it is a common practice).

How is a search engine to know where the original image was published? It may, but it may also anticipate your desire to find creative uses of that image. So you’ll see a lot of search results for “[CELEBRITY NAME] photo 2014 [SOME] awards”. You only need one result (the correct one) but the search engine doesn’t know which one that is (especially given how ambiguous your query is; it reveals nothing of your intent).

Searchers are going to type in ambiguous queries and marketers are going to target those queries; so we continue to see a LOT of unnecessary duplicate content simply because the Searchable Web Ecosystem itself is failing at the point of least amount of effort required for a barely adequate result (that is the Wikipedia Principle at work).

Not Enough Duplicate Content Is a Problem

Suppose you are enforcing your intellectual property rights and you have to search the Web for people who are using your content. Search engines will show you only so many Websites. If 10,000 sites are using your content you’re in trouble. At best you can only capture 1,000 duplicate results at a time. You have a lot of work ahead of you. In fact, the search engine fails you as soon as it indexes the 1001st instance of any duplicate content.

To make your search experience easier it will eliminate as many duplicates from your results as it can, probably showing you far fewer than 1000 listings.

In another example, you may be charged with researching how widespread a meme has become. To do that you turn to your favorite search engine and type in snippets from the meme. You find hundreds, thousands of Websites and dutifully note them in a log file. But how do you know if you found them all or a percentage of them?

Do search engines really know how many sites out there duplicate a meme? Not to 100%, but they know more than they share. We can point to the frustration that many marketers have with Google’s link reports in Webmaster Tools as an example. Google rarely shows you all the links it knows about. The extent of a search engine’s knowledge of the Web exceeds what is practical to share with its users.

The Wrong Duplicate Content Is a Problem

If you need confirmation of a meme’s source, the source itself may be buried amid all the second- and third-generation sites that share the meme. In this context you need reliable citations but the search engine shows you the wrong citations because of the popularity of the secondary citations (which may be due to links, to freshness, or to technical issues with the first generation citations).

Many times through the years as I have written articles for SEO Theory I have referred to past knowledge in the SEO industry that was once widely known (often an article written by Danny Sullivan). Finding that past article has proven futile on many occasions. The content may not have been written as I remember it, or I may be searching the wrong Website, or I may only be looking for a quote from Danny. There are other people whose comments I occasionally cannot find even though they were once widely cited: Matt Cutts once said to someone at a conference, “I see you have 50 Websites” (or something like that). Can you find the source?

I mentioned this incident in February 2011, but it occurred years before that. This forum discussion is NOT the primary source. Nor is it Matt’s blog, where he discusses the “50 Websites” conversation from his point of view. The primary source is/was a conference report published on a blog that no longer exists (as it was then).

Why is it important to find the original source of a meme? Because memes change as they are passed on from Website to Website. People take what is important to them, interpret it, and share information in their own way. It’s very much like playing the children’s game of “telephone” (before telephones it was called “the whispering game”) where you line the kids up side-by-side, whisper a secret message into the ear of one child and they pass it on down the line, ear-by-ear. What comes out of the other end is never the original message.

Bloggers (and professional journalists) all too often go with the most popular meme when citing information. I have sometimes spent hours, even days, looking for original sources before giving up and either citing a secondary source or (as in my article from 2011) citing no source at all.

In All Cases, What We Want is Relative Uniqueness

Relative Uniqueness just means that whatever you are searching for is:

Unique enough to be findable Duplicated enough to be verifiable

Sure, if you jump on an idea soon enough you will be the first person to cite a new source; and there is nothing wrong with being the first person to cite a new source. But you want to be able to find your citation in the future as more citations join it on the Web.

Also, if you are the source of information and you want to know what the impact of your disclosure may be, it is better to find relatively unique citations than to find completely duplicate citations. Sure, it’s great to know that 1,000 bloggers are quoting (and linking!) to your article; but what is the impact that you are having? You need to find out what people actually think and do in response to your information. Those relatively unique responses will be far fewer in number than the quick, easy citations.

Definition: Relative Uniqueness A search result includes relatively unique listings if and only if each relatively unique listing differentiates itself from all other listings while providing the same verifiable information.

Alternatively, a Web document is relatively unique if and only if it provides substantially qualifying context for commonly published information. In some contexts we call this “added value”, but Relative Uniqueness is not simply or merely added value.

Added value may create something other than Relative Uniqueness.

How Do You Measure Relative Uniqueness?

The acid test I have used for years is to look at how many listings are returned for a query. If there are no more than a few dozen and they are not all duplicates of each other, I assume I am probably looking at relatively unique listings.

It’s not easy to provide examples of queries that include relatively unique results, first because the metric is arbitrary and second because the search results change as more Websites publish similar information.

Relative uniqueness may be achievable by adding your own insightful commentary around a quote, but only until other people provide the same or similar commentary. Believe it or not, the probability that a lot of other people will share your views increases as more people express their own opinions on any given topic. That is just the way it is.

My acid test quickly breaks down as search results stretch toward the 1,000 listings mark. You often find many duplicate listings in a fully populated search result, but for reasons that are not always clear and obvious the search engine includes duplicate content. It would be nice if we had the option of telling Bing and Google (exclude all exact duplicates) rather than waiting for them to decide to filter the results for us.

The degree of Relative Uniqueness may be determined for a search engine by “fluff” content; this is the SEO way of adding value in most cases. Throw more words on the page so that it becomes relatively unique, but it’s really not more useful.

For the user, however, Relative Uniqueness is determined more by the difference in usefulness. In 1987 then Ph.D. student Nirit Kadmon proposed a theory of uniqueness for definites in linguistic theory. I have only read bits and pieces of the dissertation but Kadmon published the theory in 1990 and received a lot of citations (as well as credits for other work after that). In this abstract Kadmon writes:

In quantified sentences, a definite may be unique relative to another element. For example, the use of the definite pronoun in ‘Every chess set comes with a spare pawn. It is taped to the top of the box’ implies that there is a unique spare pawn per chess set. It is argued that uniqueness depends on the configuration of quantifiers and their scope; roughly, a definite A is unique relative to an element B if it has narrow scope relative to B. A general theory of uniqueness is proposed, and applied to a variety of examples.

How well does Kadmon’s work (cited in many linguistic papers) inform information retrieval theory? That’s hard to say, since linguists have been developing what we can call Uniqueness Theory for over 100 years. I don’t know enough about Information Retrieval Theory to know how much it derives from Linguistic Theory, but IR engineers definitely wrangle with Relative Uniqueness. For example, this 2008 Google patent uses a measure of Relative Uniqueness as part of a process for validating search results (the method uses density analysis on document terms to measure the degree of uniqueness).

Relative Uniqueness is a direct product of the effort you put into composing your content. Whether you are writing a sentence or an entire Website, you need to create Relative Uniqueness that is useful, memorable, and credible. All of these attributes can be forged and manipulated but the composite value created by these and other attributes of Website content (as well as citations pointing to that content) contribute to the added value we create which helps search engines distinguish between what we are saying and doing and what everyone else says and does.

You cannot create content that is so distinct it is completely unique (on the Web) because other people will copy what you do, attempt to create something very similar, and incorporate what you do into their own content. Your uniqueness lasts a short time (if it is useful). The composite added value you create is much more difficult to reproduce and therefore is less likely to be sublimated into “duplicate content” that is neither useful nor informative.

On the Web you cannot achieve complete uniqueness except through self-imposed oblivion; you can (and should) manage relative uniqueness to maximize your visibility in search.