The dead still walk. Their lifeforce depleted, they are denied the grave's final rest, doomed to forever roam the dimmest depths of the Google index, shrouded in silence and darkness, cursing their maker with duplicate content and Google Panda problems.

Warning: This is kind of a technical article is for SEO wonks. For others it probably won't make much sense. You will need some grasp of duplicate content issues, dynamic websites, Google Spider behaviour and the Panda update to understand why any of this matters. I normally try to write for a wider audience - if you are interested in my other thoughts about web marketing then please feel free to browse my other articles.

Say Hello to my Little Friend

Magento uses a 'p' URL parameter to paginate the product categories. If you run a widget shop with a category for blue widgets, then page 1 will be found at bluewidgets.html, page 2 at bluewidgets.html?p=2, page 4 at bluewidgets.html?p=4 and so on.

One curious thing about how Magento handles this p parameter is that it's happy with absolutely any positive integer value. There might only be 7 pages of products in a category, but typing in p=8 will still serve a page, as will p=60 and p=1337. These pages each display the same content - as the page with the last valid p parameter value.

In some ways this is a very desirable behaviour. The two main reasons you'd expect Magento to get a p value that's too high would be either because of a typo, or because the number of pages in that category has changed. This is true whether the visitor clicks on a hyperlink, writes the URL into the address bar, or pastes the URL in. In any of these cases, it's a much better user experience to show the product category that the visitor obviously wants than to show some kind of ugly error message.

So what's the problem? It's that these pages tell search engines that they're unique and they want to be indexed. These zombie pages are all near-duplicates, but they do have unique title tags, and unique canonical URL tags. They also have the same “INDEX, FOLLOW” meta robots instruction as you find on category pages with a valid p parameter value.

Birth of a Zombie

With Magento serving infinite versions of the same duplicate category pages, does this mean that Google is just going to keep crawling and indexing them forever? Well, no. For most intents and purposes, these zombie pages don't really exist on the website. They aren't in the sitemap. They aren't linked to from anywhere else. Search engine spiders don't index these URLs because they're never prompted to visit them.

The way they do make it into the index is when the number of products in a category is reduced. The search engine spiders will keep coming back to the URLs for the category pages that are no longer being used, and because Magento is still serving a page at this URL, that page will stay in the index.

Let's go back to our widget shop example. Let's say blue widgets were hot sellers in 2012 but red widgets are now what's in. and the widget shop has changed its product offering accordingly. It's gone from seven pages of blue widgets to two. When the spider comes back to crawl bluewidgets.html?p=7, it's now being shown a near-duplicate copy of bluewidgets.html?p=2, so that's what gets crawled and indexed.

While Google will keep indexing these pages, they are also very good at spotting that they are duplicate junk. So though these pages do get indexed, they almost never show in search results. This is a good thing for Google to do to avoid directing searchers to crap. But it also means that site owners can pay very close attention to how their site appears on Google, and still go a very long time without realising they have a zombie problem. A site:yourdomainhere.com search performed on a Magento website of any size won't reveal them even if you painstakingly pick through every result on every page one by one, because a query like this usually only lets you peruse the first 400 results or so. How did I discover these zombie pages? Entirely by accident: I was searching for a string of text on my domain to investigate an entirely different issue.

How Big a Problem is This?

Zombie pages might mean you have one or two duplicate pages indexed or they might mean you have hundreds. It really depends how drastically your category pages have changed in number.

These zombie pages are not the only way that Magneto categories cause duplicate content issues. If URLs containing mode, dir, SID, or other parameters are being indexed then you have a duplicate content problem far larger than these zombie pages. Others have covered this well already.

Beware Skullduggery

Malicious pricks could exploit this weakness to launch zombie attacks on competing Magento sites. It wouldn't even be hard to do.

All a zombie attack requires is building do-follow backlinks to zombie URLs to coax them into the index. A little bit of posting on a busy, regularly crawled, poorly moderated forum could easily index 50 or 100 pages of duplicate content.

The diabolical thing about an attack like this is how likely it would go unnoticed. Zombies are silent. No alarm bell rings to alert you that they've risen. Negative SEO is a huge topic in web marketing right now, but all the discussion is around building bad links to create a Google Penguin penalty. I see nobody really talking about exploiting peculiarities of dynamic websites to create Google Panda issues. So even if the site owner is alert to Negative SEO and spots these links in the CSV file downloaded from Webmaster Tools and decides they look suspicious.. their likely response to suspicious looking backlinks is a removal request and disavowal process. This does nothing to clean zombie pages out of the index.

Zombie Patrol

How do you know if you have zombies?

Zombies live in Google's supplementary index, so we look there. Your two best zombie hunting tools are a site:yourdomainhere.com/productcategory.html search, and clicking on that “repeat the search with the omitted results included” link at the end of the search results. This link will show you some of the pages in the supplementary index. The supplementary index contains duplicates and junk pages - the kind of stuff most searchers don't want to see most of the time. Because we're zombie hunting, duplicate content is exactly what we want to see. The site:yourdomainhere.com/productcategory.html search will show only results that begin with that URL. This will check a single category for zombie pages. You can tell which pages are zombie pages by reading the URL. Zombie pages will have a p parameter higher than the last live product page currently on your site. To comprehensively check your whole site you will need to check every category this way.

Kill Zombies With Your Bare Hands

How did I get my client's zombie pages out of the index? I wrote a disallow instruction in robots.txt for each individual zombie URL, and then used Webmaster Tools to remove the pages from the index. Once the zombie pages had been deleted, I went back and took the disallow instructions out of robots.txt.

Is this the best solution? Hardly. It doesn't fix the problem at root cause. It's a dirty hack. It's held together with tissue paper and chewing gum and half a muttered prayer. It's tedious and time consuming work that can only be properly done by someone with an eye for what they're looking at. Worse still, it does nothing to stop more zombies rising. No alarm sounds for new zombies, so you'll still have to patrol regularly. Maybe you might keep the disallow instructions in your robots.txt, but that could mean that real pages don't get indexed as you add new products.

Kill Zombies With Weapons

Far better would be addressing the Magento behaviour that raises these zombies in the first place. The real problem is that the zombie pages have an “INDEX,FOLLOW” instruction to the robots telling them to index this page, and a canonical URL tag that announces that this is indeed a unique page. The best solution would be to correct the canonicalisation to tell the robot that what they're reading is actually the last real page in that category. The next best would be to change the instruction to the robots to “NOINDEX,FOLLOW”. The only practical difference between these two solutions that I can see is that the URL canonicalisation would keep the juice from incoming links to zombie URLs. This is probably not a real factor for most e-commerce sites with a natural link profile.

Happy zombie hunting!