Intro

Let me begin by taking a moment to thank John Doherty and the rest of the Distilled Team for giving me a chance to share of this research with you today. I’ve had the pleasure of getting to know the members of Distilled through their annual SearchLove conference, which was an amazing experience. Lynsey and Lauren definitely know how to plan an event, and their hospitality was amazing. I experienced this first hand, as I was a very last minute attendee of SearchLove (I emailed Lynsey at 3am the day of and she got back to me in plenty of time to arrive for the first speakers; thanks again Lynsey!).

On that note, if you didn’t attend, not only did you miss a special appearance by EmCee Hammer; you missed Mike King’s awesome presentation “Targeting Humans” which contained a brief section on my recent paper “GoogleBot is Chrome: The Jig is Up… or why a search giant built a browser.”

You also missed plenty of other things too… The speakers were great, and all the presentations were packed with usable information or handy resources. The after-hours meet and greets were no less interesting, and that’s really where Distilled’s hospitality shone through. If you didn’t attend, you really lost out. Be smart, learn from your mistake, and plan to be there next time around. Also don’t forget to check out the Live Blogs and Wrap Ups that are sure to be floating around on Distilled, and other places like Outspoken Media or SEOMoz.

In case you haven’t read it, “GoogleBot is Chrome” outlines a theory that Google’s Search Crawler GoogleBot is actually built off of the Chrome Web Browser, and may even have been the primary reason for the development of Chrome. If this is true, it leads us to believe that GoogleBot is a lot more intelligent and capable than Google is letting on, and we’ll have to totally abandon the dated notion that GoogleBot is looking at a website in a manner similar to Lynx or other text-only browsers.

Instead we may have to face a reality where GoogleBot’s capabilities rival that of any other web browser, being able to crawl and index DOM Transformations from AJAX or JavaScript Libraries like jQuery.

You can catch the full version of the original paper at Mike King’s site if you want to dive into the initial evidence. This post is going to be a follow-up to the original, as since the launch at SearchLove a wealth of other data has come to light.

Stay tuned, and we’ll cover some recent revelations from Google, and some potentially game changing proof of concepts that show us Google isn’t being totally straight with us on their indexing capabilities.

Recent Revelations

https://twitter.com/#!/mattcutts/status/131425949597179904

A few hours after the paper debuted at SearchLove, Matt Cutts tweeted a confirmation regarding a recent ”Digital Inspiration” article that observed GoogleBot is capable of crawling the AJAX content found in the Facebook Comments Plugin that’s so popular across the web. Based on the patent evidence recently uncovered, this was a strong indicator that GoogleBot is closer to Chrome than Lynx; especially since this functionality was only casually confirmed, rather than publically announced. This capability probably wasn’t something new to the search stack, just something the SEO Industry recently noticed.

Google also confirmed that they are applying analysis on content “above the fold,” something that is hinted at in their Visual Gap Analysis patents.

The functionality has already seemed to roll out via some Instant Previews (which are absolutely generated via a Browser - http://sites.google.com/site/webmasterhelpforum/en/faq-instant-previews#02)

The Instant Preview shows a jagged cut representing the boundaries of the non-scrolling viewable space, commonly referred to as “The Fold.” The content extracted by GoogleBot and highlighted in the Instant Preview comes from below The Fold, which Google seems to have been able to accurately identify. It was highlighted as Google appears to have considered it relevant for ranking purposes. While more testing on this is definitely needed, it still seems to suggest that Google’s capabilities are more in line with the patent evidence than with their public announcements tend to suggest.

An example of how precise the content extraction below the fold can get, we need only turn to the query “GoogleBot is Chrome” where the highly accurate excerpt “What if Chrome was in fact a repackaging of their search crawler; affectionately known as GoogleBot, for the consumer environment?” can be found in the Instant Preview for the article over on IPullRank.com

That sentence was actually written as a summary of the introduction and to see my intent so accurately extracted is pretty impressive. Google’s ability to accurately identify, extract, and visually present content is signs of an increasingly complex spider moving well beyond the average “Lynx-like” crawler we tend to think of in the SEO World.

With that in mind, and with some prompting from John Doherty (thanks for the heads up!), I took some time to do a little digging and see how advanced GoogleBot really is when it comes to JavaScript.

What we learned is a Game Changer…

Thanks to all the feedback I received, I was really dead set on trying to uncover some “Smoking Gun” evidence that GoogleBot was behaving more like a browser than a semi-smart text-based crawler. Patent evidence is great, but is often without context… protecting intellectual property via patenting often comes well before public implementation.

A good portion of the things I found were to be expected, a few were pretty odd, and one example was so interesting it may change the way we think about GoogleBot and the Link Graph.

The Expected

A quick Google Search using the “Filetype” advanced operator shows Google is definitely finding and indexing JavaScript files, and even CSS files. Repositories like Google Code and Github make it a little harder to find literal examples, but if you set your search settings to return 100 results at a time and scroll, you’ll find more than a few on the first page. Look for the URLs without text descriptions or titles, as those tend to be actual CSS or JS files.

Even in the expected, there are some interesting standouts; like locally hosted Google Analytics JS files, jQuery, and Drupal Theme Style Sheets occurring with an interesting frequency.

The Odd

This next example could belong in either “The Odd” or “The Expected” depending on your own understanding of Google’s crawling and indexing technology. I’m inclined to think that this kind of JavaScript indexing is common… but if you follow the party line that GoogleBot is a text-based crawler with specialized JavaScript capabilities, than it’s definitely odd.

Google indexes the phrase “first name” on LasVegasTickets.com tickets’ results page due to the newsletter widget embedded on the top of the right hand navigation. Oddly enough, this is actually a JavaScript widget from a third party contact solution.

A quick visit to the same page using the Firefox NoScript plugin gives us a default NOSCRIPT text warning us that we need JavaScript for that widget to work. Google also indexes this warning text as part of the page content. A quick scan of the HTML Source using Firebug shows that there’s no DIV with the proper name in the script source.

Re-enabling JavaScript allows the script to execute and instead of the warning message, we’re greeted with a straightforward opt-in form for a newsletter. A search of the source with Firebug shows a hit for the DIV ID “emma_member_first_name,” allowing us to infer that the script is using DOM injection techniques to add the DIV to the DOM after the page has loaded, and then probably using some CSS to ensure that the layer shows on top of the other element.

This elegant, low-tech solution is a great example progressive enhancement in web development which serves to accommodate users who disable or lack JavaScript Support. The Flash Replacement tool swfObject is another example of this DOM injection methodology being used to support a User/Search Friendly experience through progressive enhancement techniques... that aside, it’s a JavaScript DOM Transformation and Google really shouldn’t be able to index it if it’s closer to a text-only browser than Chrome.

It’s definitely an odd item to train a custom script parser to execute, as it doesn’t really add any value to the index. On the other hand, it would be trivial to index this item if your crawler were a browser, already executing scripts by default.

This wasn’t even the oddest example from the bunch though… Google seems to have indexed 530 versions of TweetMeme’s ad serving script at ads.tweetmeme.com/serve.js.

The indexer is very “greedy” from a programming standpoint; that is to say it tries to capture as much content as it can whether or not it can do anything with it at this point in time. The indexer is very ignorant and doesn’t see the duplicity it’s indexing. The indexer is very intelligent and is able to identify differences in these URLs beyond HTML Markup.

This behavior is odd and unexpected largely because the differing versions don’t add anything substantial to the index. The file only outputs a little inline CSS and the JavaScript to embed the image, so there’s nothing for the crawler to really get at. Ultimately this kind of behavior speaks to the quality of the Indexing technology and we can infer one of three things:I personally find 2 to be pretty unlikely this late in the game, though there was a time when it was probably quite true. 1 is very likely considering Google’s stated goal of indexing the world’s information and making it useful, but it doesn’t take the full scope of Google’s capabilities into consideration. Either way, without some intelligence in place, these options both have scary implications for search quality.

The Instant Preview clearly shows a rendered ad, and we know from our previous digging that Google is deploying some form of Visual Analysis to the Index, and is able to correlate that analysis back to the Instant Preview.

Taking all of this into consideration, it’s very possible that Google is probably monitoring file size/page load time, and maybe deploying OCR or even using visual analysis to identify differences in these files… making 3 a very viable option as well.

The Interesting

When I first started putting together this research, I ended up running into a lot of dead ends; Search Friendly design has spread far and wide, leading to many instances where content I thought was being created by JavaScript was really just a hidden DIV being brought into view through CSS Transformations. I literally sorted through hundreds of tickets sites, hotels, airlines, and news sites before I found this surprising example.

The first time I tried to grab an Instant Preview of “ wcyb.sportsdirectinc.com ” I couldn’t help but notice it didn’t give a preview. Curious, I opened the page with the NoScript Plugin enabled, and found myself facing a warning for a JavaScript based redirect.

This made the perfect opportunity to test the limits of GoogleBot’s JavaScript Capability, so I let the page load without NoScript and took some text from the page I ended up on after the redirect.

Not only did Google manage to Index some of the text, but this time the Instant Preview worked… and it showed the redirect’s destination page!

The indexed text and Instant Preview being displayed ended up matching the redirect destination, while the URL displayed matched the source of the redirect.

This suggests that not only is GoogleBot executing JavaScript, it seems to be following JavaScript Redirects and treating them as 302 Redirect. This isn’t the most interesting part though; it seems Bing is doing this too!

The true scope of Bing’s ability to execute and index JavaScript is unknown, though Microsoft is also the holder of patents related to Headless Browser Based Crawlers. Oddly enough, the BingBot was able to keep pace with Google when it came to the JavaScript redirect, but for whatever reason did not index the contact form from LasVegasTickets.com.

Conclusions

This could be due to search quality, or technological limitations, but either way it’s interesting…Google seems to favor a cycle of internal innovation, followed by public announcement and user experience enhancement. When Google deploys new technology to the indexing stack, they probably want to let it burn in and gather data to compute meaningful features for extraction. The Google N-Gram Corpus for example, helped make spam detection and natural language processing based topic modeling a reality for Google and was only possible once significant amounts of textual data was absorbed into the index.

Once the functionality is actively affecting the Index, rather than being used primarily as a learning tool for the Index, we tend to see a related User Experience enhancement which not only exposes the new functionality to enhance the search experience, but also allows Google to gather usability data on the behavioral impact of the new data.

For example, Universal Search came about as Google began gathering lots of business data, map data, and news data.

In time both video and social were integrated into the experience… an outgrowth of the real time indexing capabilities Google had been focusing on adding.

I believe Google Instant Preview is the user experience enhancement that heralds the inclusion of Visual Analysis and Browser Based Crawlers

Ultimately, all we can do is make educated guesses about the exact nature of Google’s (and Bing’s) true indexing capabilities as we’re on the outside looking in. The Patent Evidence and the public statements tell conflicting stories, and with the research lining up more with the Patent Evidence than the public announcements we should probably wonder what benefit Google may derive from keeping us in the dark about their indexing capabilities.

It’s highly possible that these innovations are still being tested, or are only rolled out to certain crawling servers, though it’s more likely that Google is seeking to avoid contaminating their data pool. Google has used the power of Big Data very effectively in the past, and the ability to learn from the trends of the Web gives them a competitive edge.

Search Quality is probably the primary motivator for keeping their full capabilities under wraps… the more we know, the more feasible it becomes to attempt to “game” the Algorithm, which ultimately taints their data pool. Features are most meaningful when they’re a natural outgrowth of the behaviors of real users.

Profit can never be overlooked as a factor either; Google is built on the back of Google Web Search. The Adwords Empire only matters because of the Algorithm… and people use Google Analytics and Webmaster Tools to learn to please the Algorithm. If more people could effectively game the Algorithm it would impact search quality and ultimately affect Google’s bottom line.

And perhaps the most persuasive argument is from the perspective of innovation… the less brash young upstarts understand about Google’s capabilities, the less likely they will be to field a truly viable competitor. Google has tons of potentially world changing innovations simply sitting dormant waiting for the right market and model to help them monetize it effectively.

My personal favorite example of this is Google Translate, which is one of the most accurate machine translating tools on the planet. Google almost sacked it because it was not profitable, and had it not been for public outcry we may have lost access to this technology altogether. Tools such as this truly have the ability to impact the entire world. Imagine being able to combine this with the Speech to Text commonly found on Android Devices to create an Instant Translator for disaster relief operations?

This amazing capability is simply a side effect of indexing volumes of data… it’s an accidental innovation that almost got moth balled in favor of making sure you can +1 that Youtube video of the kitten sleeping in a tea cup.

That said, the jig is still up Google… and it is game time. Who’s ready to help build the future?