It seems that everyone has a theory about why they and their clients, friends, and neighbors did (not) recover lost rankings with the Google Penguin 3.0 update of October 17, 2014. Remember that date because that is when the first claims of significant changes in traffic began to appear on discussion Websites. Most people did not take notice until Saturday, October 18, however; therefore you can count on people to always cite Saturday as the day that Penguin 3.0 was released into the wild.

I have been canvassing the case studies, as I always do, and as usual I have found barely a handful of interesting stories to read. Most of the Penguin 3.0 articles you’ll find this week are opportunistic nonsense published by people looking for more clients, more subscribers, more followers, etc. A few people are sharing some interesting near-details but they are reluctant to name domains and provide other clear evidence showing why they think their sites were hit by the previous Penguin (released more than a year ago) and why they think they recovered. As you can see from the chart to the side you can only trust about 1-in-20 SEO case studies. I’ll explain where I collected the data for that chart below, but first let me poke some holes in the myths and nonsense that we’re about to be flooded with this week (and this article will prove useful for years to come, because only 5% of all SEO Case Studies are Reliable).

First, Credit for all the Hard Work Done

Yes, a lot of people invested a lot of time and effort in figuring out why they or their clients’ sites were affected by Penguin. For most people this came down to scouring backlink profiles and figuring out which sites to disavow or contact for removal. How well they did this job directly determined the effect of their work on the latest Penguin algorithm.

Google announced in advance that there should be a lot of happy people with this update, and based on a lot of the Tweets and blog posts I have read over the past couple of days I say they were correct: a LOT of people seem to be happy with the results of this update.

FOR THE RECORD: None of Reflective Dynamics’ SEO clients suffered any Panda or Penguin downgrades in the past two years. We don’t believe in setting up clients for the next wave of grief in Internet Marketing with overly enthusiastic SEO practices based on the advice you will find on many SEO blogs and forums. #nuffsaid

As you can see from this second chart, most of the links that have been disavowed or removed were really not hurting these Websites. See below for how we collected this data and you’ll understand this chart better. But in the meantime, let us look at some of the things that a typical SEO link removal service should be telling its clients.

What We Can Confirm about the Penguin Algorithm

Google announced the original Penguin Algorithm on April 24, 2012. This algorithm looked at two signals that Google publicly confirmed: on-page keyword stuffing and “unusual linking patterns” (outgoing links). They hinted that the algorithm was looking at other signals they did not want to disclose. Naturally, 99.98% of all SEOs, content marketers, and other Web spammers zoomed right in on the suspicious linking and virtually ignored the fact that Google led off with an example of on-page keyword stuffing. Of course, a fair number of SEOs know to look for this anyway, so there was probably less harm done than it may appear.

In May 2013 Matt Cutts disclosed that the first generation of the Penguin algorithm “would essentially only look at the home page of a site.” Although this caused a slight stir among SEOs (not to mention some very dubious conclusions from even well-known people in the industry), it was kind of a moot point because Google soon released Penguin 2.0.

All we really know about Penguin 2.0 is that it “dug deeper” into a site. People will argue from now until Doomsday over what that meant. I’ll go out on a limb and say that it meant (at the very least) that the second Penguin algorithm was taking more data into consideration than the original Penguin algorithm.

And now we have come to Penguin 3.0, which Google says is a complete rewrite of the original algorithm and took an entire year to build, test, and validate. For this version, we learned from SMX East that there was a cutoff point for Disavows that would affect the next Penguin release: approximately September 15. In other words, we know that Penguin 3.0 is taking Disavow data into consideration but none submitted after (about) September 15. You will have to wait for future Penguin updates for any Disavow data submitted to Google from about September 15 onward to have an impact.

Much Speculation has Centered on the Disavows

Almost everyone involved with Internet Marketing who has an opinion has firmly declared that Google is reviewing all those Disavows manually to find the bad link networks. Then again, John Mueller says that unless there is a manual penalty, no humans look at the Disavow data. So all those Penguin-related Disavow files have gone into the black hole of computational figury. Whether the algorithms can use this data to find blog networks is questionable. There are several reasons why that would be very, very hard to do.

A lot of innocent Websites were disavowed Many blackhats chose NOT to disavow their link networks The Web spam team has been manually deindexing link networks for years

Whatever data the algorithms might be able to extract from the Disavow files would be very dirty and unreliable. In my opinion Google probably just treats the Disavows as if the links they cover use “rel=’nofollow'” and they leave it at that.

Yes, that could change if they figure out how to extract useful data from all those Disavow files, but I think we’ll be manufacturing glaciers to reverse global warming before that day comes. Elon Musk will have replicated himself five times over by that day.

What we can take away from this is that a lot of internal PageRank has been wasted by over-zealous or hyper-paranoid marketers who took down links and disavowed them too fast, too soon. That lost PageRank has undoubtedly hurt a lot of Websites.

What The Above Means for This Penguin 3.0 Update

Some of the people who have lost position in the SERPs may have lost it because they disavowed or removed too many links. I don’t know how many people fall into that category, but hopefully very few people who did NOT need to clean up their backlink profiles made the attempt.

Some of the people who have recovered from their previous Penguin downgrades are coming back weaker than they should be. That is because they removed or disavowed too many wrong links. They’ll never know, of course, which links they should have removed because Google is afraid to share that kind of information (besides which, if they did then Danny Sullivan would win the argument over which approach to handling bad links is better and Google would have to admit that they could more easily just ignore the links they don’t trust).

Some people who are tightly integrated with spammy linking neighborhoods may benefit two ways from this update: they disavowed enough bad links that their own downgrades were lifted AND some of the sites linking to them also came out of the Penguin Penalty Box, meaning their links should once again help whichever sites they point to.

But all these recovery stories we’re about to slog through for the next 1-4 weeks will be filled with crazy arguments and bad conclusions based on incomplete or completely wrong interpretations of what Penguin is and does. Technically, no one outside of Google knows what the algorithm does (and I doubt that very many Googlers know much, either).

So What about Those Charts Above?

All the data in the charts above is bogus. I made it up. You see, Timmy, research proves you will believe any bullshit if it looks scientific enough. In fact, the people most likely to believe nonsense accompanied by charts are people who put the most faith in science.

What is true is that I have read hundreds, perhaps more than 1,000, SEO case studies since 2007. I have read far more than 1,000 SEO case studies since I started reading them in 1998. The vast majority are pretty badly written. I can’t put an exact percentage on it because it never occurred to me, all those years ago, to start counting the good, the bad, and the ugly. Most of those old case studies are no longer online now anyway (I know this because I have had to clean up many dead outbound links on SEO Theory across the years).

Sure, you get people who hire “experts” to churn out charts and crap, but no computational process is any more reliable than the data that it ingests. The people with the most reliable case studies tend to be the people who don’t have to collect third-party data. These are usually the simple bloggers who just share what happened to them, what they did, and what happened after they stopped doing stuff. Their conclusions are often wrong but their anecdotes are less cluttered by computed garbage.

The larger companies who canvas hundreds or thousands of customer accounts can also produce good case studies, although that is not always the case. That awful click-through rate study by Advanced Web Ranking is a perfect example of a horrible waste of time and effort chasing phantoms. See my comment on Marketing Land (NOTE: they have removed old comments since I wrote this article) for a lengthy list of problems with that study (and the list is nowhere near complete — there are many other problems with the study). Just because you have access to a lot of data does not mean you use it well.

It is NOT easy to reverse-engineer any complex process. But attempting to reverse-engineer a constantly changing process is just plain stupid. People should know better by now than to trust these egregious marketing articles that are filled with charts and specious claims; unfortunately, the majority of Internet Marketers just don’t seem to learn (which is why we continue to see so many search algorithm changes).

Don’t believe an article is correct just because it has charts and screen captures. Question everything. There is no doubting that a lot of Websites have recovered from Google’s Penguin algorithm but we will probably never know exactly how or why they did. Every hypothesis you come across in the next week or two will be countered by some other conjecture. Everyone will have their own data and stories to tell. Most of the stories will be poorly told, much of the data will be unusable, and it’s almost guaranteed that anyone who produces a massive study of Penguin recoveries will — no matter how good the overall report — miss something.

A Word on SEO Algorithm Weathervanes

More than one person noticed this weekend that all the favored SEO indices failed to spot a massive algorithmic change (although it’s debatable how “massive” Penguin 3.0 is, given that it’s targeting Web spam and site recoveries from a relatively small portion of the Web). The only tool I have seen that came close to spotting “something” was SERPWoo (and boy have THEY been telling everyone they saw “something”). So, Kudos to SERPWoo but you have to be able to do this every time and so far no SEO indices have succeeded in achieving that kind of reliability.

There are many reasons why it’s hard to spot changing trends in real-time when you’re studying search results. Here are a few of the biggest flaws in many popular SEO indices’ strategies:

They crawl Google from random IP addresses

They crawl Google from a single IP address

They crawl the search results for more than 1 hour

They use high search-volume query data

They use “Average Position” data from Webmaster Tools

They use broad data windows (often 1 month)

Google doesn’t show the same search results to everyone. They change the results subtly based on many factors. If I were going to track search results today and I had unlimited resources I would use about 200-300 fixed locations around the globe. I would make sure these locations searched as “desktop”, “laptop”, “tablet”, “smart phone”, and “unsmart phone” devices (but keep in mind there are between 1600 and 2000 different types of environments from which people search, and that number will grow drastically over the next few years).

Every new machine+operating system combination can potentially lead to different search results depending on advertising and how much information the search engines are able to glean from the searcher’s platform and environment.

So you have to capture a lot of data CONCURRENTLY from a large selection of environments. Then you can attempt to normalize the data, but given that no one has even begun to address this need there are no algorithms available to help with that normalization. Without normalization you’re stuck using aggregated data that is just a mess to work with statistically. And the statisticians will debate all day and night over the value of normalization. The odds are against you before you have finished designing your new SEO tool on the napkin.

SEO index developers like the high-volume keywords. This is the worst possible data set to use when you want to diagnose an algorithm because all the competitive tactics that marketers use to promote their sites high in the search results for those keywords will skew the data. That, of course, is why all “ranking factors” lists and studies are completely useless nonsense. The people who put these studies together do not know what they are dealing with; therefore they do not know what they are doing.

But they publish lots of pretty charts, don’t they? And because they publish charts you will be fooled into believing that they actually have some sort of insights into how Google works. They don’t.

If you’re going to study ranking algorithms you need to do so in real-time and you need to capture a lot of data and crunch it very quickly because according to the search engines they are making changes to their ranking systems every week. If you wait 1-2 days your data is obsolete and so are your conclusions. We can safely say that as of today all ranking studies are 100% invalid, because none of them were built and published within the last hour.

Is it possible to reverse-engineer the static algorithms like Panda and Penguin, though? Maybe. If you can collect enough data about them you may be able to figure out what triggers they are using. But that takes a lot of effort, and a lot of data. I can count on one hand the number of companies I think could have a shot at doing this reliably, and most of them are companies you would not put on that list.

As we slog through all the Penguin Recovery articles sure to be published in the next few weeks let us remember that, yes, a lot of hard work went into these recoveries and they probably would NOT have happened without that hard work. The SEO agencies and consultants who dug into all that data and helped detoxify the backlink profiles deserve to be congratulated for getting the job done.

There IS collateral damage but as I always point out, we can fix just about anything in SEO, given enough time. The collateral damage may be extensive but there is no way to measure it. So as soon as you have finished sipping champagne with the clients remember that there will be more mistakes to avoid and repair, especially if you or they continue to follow the really bad advice that got you all into this mess to begin with.

There are no excuses for NOT challenging all the SEO case studies that are published every month. You should have learned by now NOT to be so gullible.