There are too many recent articles about the Google Penguin update for me to have read more than a small percentage of them. It never fails to amaze me how dozens of people will publish opportunistic articles within hours or days of a major algorithmic event. These articles promise to give you the cure or the explanation or both behind the algorithm. None of these people are qualified to do that. One blog post in particular received a lot of attention. You know the one.

This is the one that annoyed me the most: Google Announced 50+ Search Updates, Which Are Penguin Related?. That was written by Barry Schwartz on Search Engine Roundtable and there is absolutely nothing wrong with the article. Barry is just doing the obvious, and he avoids all the opportunistic hype. In fact, he invites speculation from his readers.

But here’s the problem I have with your article, Barry: You beat me to it.

Frankly, I just haven’t had much time to think about Penguin lately. I’ve actually been more concerned about the OTHER major April update from Google — the one that occurred around April 15/16. I would have expected a lot of complaints about that one but I guess it was just targeted at me. I have one site that lost a lot of Google traffic around April 15. I’m really not sure why.

Google Published 52 April Algorithmic Changes

So Matt Cutts listed 52 changes Google made to its algorithms in April. Barry picked six as his candidates for the most likely Penguin “events”. I don’t like second-guessing search engines with lists of things because there is no way to determine who is right or wrong, except to get someone at the search engine to confess the details.

Still, if you held a gun to my head and forced me to pick something, I would go with maybe at most two changes for Penguin:

Anchors bug fix. [launch codename “Organochloride”, project codename “Anchors”] This change fixed a bug related to our handling of anchors. Keyword stuffing classifier improvement. [project codename “Spam”] We have classifiers designed to detect when a website is keyword stuffing. This change made the keyword stuffing classifier better.

These are the only two changes that really match well the examples Matt gave in the April 24 announcement that accompanied the Penguin update. You probably don’t need more than these two types of changes to target Webspam.

But how do you know? Google doesn’t say when any specific change rolled out. So various agencies and individuals started up their correlation machines and started crunching numbers and rankings and they spat out their analyses of what Penguin may have affected and, really, they provided no enlightenment into what happened.

Search Algorithmic Updates Are Not Isolated Events

The chief reason why your attempts to decipher search engine algorithm secrets fail is that you don’t know even know where to begin. A few people dreamed up lists of “search ranking factors” that have little connection to reality, but surveys and libraries of SEO analysis have been based on those ranking factor lists and people act like these lists of guesses at factors are authoritative.

Ongoing attempts to identify these ranking factors are being conducted in the dynamic environment of the searchable Web ecosystem: people are changing their queries, search engines are changing their algorithms, and publishers are changing the content-and-link mix. So amid all these changes you would have to be pretty fast at grabbing a snapshot of the entire ecosystem before you could even begin to build a credible analysis. In other words, you MUST capture a single state of the ecosystem in order to have any hope of accurately deciphering what is going on.

Instead what people are capturing is a smeared state, a blending of portions of multiple consecutive states of the ecosystem. The smeared state essentially invalidates all hypotheses. To the best of my knowledge, we don’t yet have a science that even attempts to define the nature of a smeared state, much less propose any methods to analyze it. You might as well be trying to break down the components of Time — and so all your statistical tools and models are worthless.

You can certainly correlate things in sub-sections of the searchable Web ecosystem. For example, you can isolate particular Website behaviors and correlate them with changes in search referral traffic. That kind of correlation does nothing to prove cause-and-effect but it sure points a big fat finger at something over which you (the publisher) have some control.

Matt Cutts Gave You A Major Clue About Penguin

Despite Google’s reluctance to share much detailed information concerning the Penguin “over-optimization” update, Matt shared too very specific examples of Web documents that are being targeted. One example was a classic keyword-stuffed page (which, if Google is only just now able to identify that kind of stuff, once again offers proof that Google is NOT using any sort of latent semantic technology); the other example was a spun article in which irrelevant link anchors were scattered throughout the text.

And Matt also warned that the Penguin algorithm can detect much more subtle stuff. Although he offered no specific information, he did link to this part of Google’s Webmaster Guidelines, which can be summarized as:

“Make pages primarily for users, not for search engines”

Ask yourself “would I do this if search engines didn’t exist?”

“…avoid links to web spammers or ‘bad neighborhoods’ on the web”

Don’t use rank-checking software (probably not related to Penguin)

Avoid hidden text and hidden links (probably not related to Penguin)

Don’t use cloaking or sneaky redirects

Don’t send automated queries to Google (see below)

Don’t load pages with irrelevant keywords

Don’t create multiple pages, subdomains, or domains with substantially duplicate content

Don’t do malicious stuff (probably not related to Penguin)

Avoid doorway pages (probably not related to Penguin)

I think the first point really narrows the field on what Penguin is looking for: stuff that was intended for search engines, not users. That would include any text intended to make pages look relevant to queries, any text intended to make pages look like “real content”, any links intended to boost rankings for other documents/sites, and in general anything that the average passionate, uninitiated blogger would NOT be likely to do on a Web where search engines don’t exist or don’t matter.

These are, in my opinion, pretty clear-cut guidelines. You got nailed by Penguin because you were doing stuff for search engines. You may feel like you should not have been nailed by Penguin, but Google never said it was trying to follow your line.

The fact that Google came back and asked people to submit sites they felt were wrongly penalized suggests to me that maybe Penguin works like Panda, in that it’s a learning algorithm and perhaps Google may expand the learning set to help some sites recover (in a few weeks or a few months). The first signs of recovery from Panda came about 5 months after the algorithm rolled out. So look for improvement in another 1-4 months.

What Did Google Do About Automated Queries?

There were two updates that drew my attention and curiosity:

No freshness boost for low-quality content. [launch codename “NoRot”, project codename “Freshness”] We have modified a classifier we use to promote fresh content to exclude fresh content identified as particularly low-quality. Fewer autocomplete predictions leading to low-quality results. [launch codename “Queens5”, project codename “Autocomplete”] We’ve rolled out a change designed to show fewer autocomplete predictions leading to low-quality results.

There are people who have boasted/claimed that they could influence Google’s autocomplete predictions by running automated queries (or paying a lot of low-cost people to manually type in the queries). The Freshness algorithm might be influenced by a recent spike in queries (but please don’t quote me as an authority on how Google determines or uses Freshness).

Query Spam has been around for a long time. The earliest example of query spam that was pointed out to me was the way some agencies manipulated the DirectHit algorithm. They ran queries through sophisticated networks and clicked on the desired results, thus influencing Direct Hit to elevate those listings in the search results. This technology could have been adapted by people engaging in click fraud on PPC listings. So Query Spam has a long history and it has probably been more of a problem for search engines than they want to divulge.

Getting Back to the April 15 Google Update

I see a number of “low-quality content” references in Matt’s list of 52 updates. I don’t believe my Website should have been affected by such updates but then Google has never really given us clear direction on what it considers to be “low quality content”. When Google says “low quality content” I think about the Panda algorithm. The Panda algorithm took out two of my Websites last year, one of which (Xenite.Org) simply polluted itself with thousands of broken Perl scripts, thus ruining the pages and the user experience (but the fact I was depending on Perl to help populate so many pages is a big red flag). I got rid of the Perl and those pages and just republished the most popular hand-written articles in a new template and Google kindly restored my traffic very quickly.

The other Website didn’t recover until I took all the articles off of it. They were all original, unique, hand-written articles. I moved them to another domain which seems to be doing okay now. The original domain is also doing okay (except it no longer has a lot of content so it doesn’t get much traffic). As far as I can tell there is nothing wrong with the articles. I didn’t write them so they tended to be shorter than what I normally write, but not in all cases.

As for the site that took a hit from April 15 — well, those were very different articles. Most of them were much, much longer — and they included pictures, and linked out to other Websites more liberally than the Pandalyzed site, etc. I don’t know what the April 15 algorithm change was targeting but I don’t think it was really the depth and structure of articles.

The fact that Google added a new tier for content in its index — apparently with a different crawl rate from the other two tiers — suggests to me that Google is trying to hedge its bets on the “low quality” stuff. I’m not sure that is a good sign; instead, it may simply indicate that Google is moving on to deal with other priorities and it will leave the questionable stuff at a low tier for future consideration.

In other words, as always I don’t have enough information to really see what happened. I can only see the effect of what happened. And given that so few people seem to have felt the pain of April 15 I can only guess that my site was doing something naughty by Google’s standards — not necessarily in terms of Web spam but rather in terms of failing to differentiate itself from other sites.

What Really Matters to the Google Algorithm These Days?

I can’t give you any lists of ranking factors but I can point to a small selection of things that distinguish much of what Google is doing.

Locally-relevant information is being promoted to the top of SERPs (it may or may not be accurate or more useful than other information)

Content built to influence search results through keyword stuffing or links is being demoted or delisted

Content associated with strong signals of quality and authority is being given preference in SERPs

On-page content is being analyzed more deeply and thoroughly and measured in new ways to find better results

Of course, plenty of counter-examples have been discussed in various Web forums. People have shared some outrageously bad results — though to their credit Google has quickly made adjustments on some of the more egregious errors. But what that tells me is that as Google tightens the screws and slides the bars farther away from the homogeneity of the center toward the extreme ends of the spectrum we have to expect some false-positive signals to emerge on occasion.

So that leaves me wondering if my April 15 issue is a false-positive or if there is something deeper going on. I certainly made some changes to the site in question; perhaps they were too little, too late. I have to wait out the storm and see what happens. Unfortunately, unlike the Penguin update, Google hasn’t handed anyone any specific clues about what to look for.

If you only lost traffic in Penguin you’re lucky. At least you know where to look for your problem. You may not see the solution but if all else fails, wipe your site clean and start over. This time around don’t do anything “for SEO”. That’s always been a bad play.

Search engine optimization is NOT and never has been about “doing things for SEO”. That’s like saying you drive a car “for driving”. Sure, you can love getting in your car and taking it for a spin around the block but you’re not going to make much money doing that and you sure won’t impress many people. So “doing stuff for SEO” is pretty much asking for a search algorithm downgrade.

You can point to all the correlation studies you have come to love and trust but at the end of the day if you don’t change the way you publish your Website you’ll need to take Google out of your marketing plan.

And that just might be the best thing you can do for your search engine optimization.