Do you really think someone would do that? Just go on the Internet and tell lies?

While there are ways to detect whether a request is actually from Googlebot or whether a the user-agent is being spoofed, many sites will accept the user-agent at face value, which often results in behavior that is somewhere between "moderately amusing" and "occassionally useful."

User-Agent Spoofing Overview

What is a user-agent?

When a web browser, bot, or other (client) computer makes a request to a webserver to request a webpage, the browser provides various pieces of metadata to assist the server in providing content that will work best for the client. One such piece of information is the user-agent. As described by the mozilla developer documentation:

The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

For a regular human user, this is all handled behind the scenes by your browser. For example, a user who is running Opera 12 on Windows XP may have a user agent along these lines:

Opera/9.80 (Windows NT 5.1) Presto/2.12.388 Version/12.18

This information may then be used by the server to adjust user experience. For example, if a user-agent that indicates that a user is on a mobile device, the server may return a version of the website optimized for mobile users.

What is user-agent spoofing?

Since user-agents are reported entirely by the client (such as your browser), they can be altered to any arbitrary string at the discretion of the user. The video below demonstrates a hacker spoofing his user-agent:

Providing a user-agent that differs from the default or "correct" user-agent that is otherwise sent by your browser is known as user-agent spoofing (for example, identifying as FireFox on Linux when you are running Opera on Windows 10).

What does it mean to spoof a user-agent as Googlebot?

Robots that make requests to websites generally provide a user-agent that identifies them and provides additional information about the bot (although not all bots do so and some may provide inaccurate or misleading user-agents). Since user-agents are set at the sole discretion of the user (or bot-creator, in this case), the string can by anything the developers would like to set it to. The user-agents that are set for Googlebot by Google's engineers are generally strings similar to the following:

Googlebot/2.1 (+http://www.googlebot.com/bot.html)

In addition to the classic Googlebot, Google also has a number of other bots, as listed in the documentation on Google crawlers:

Spoofing a user-agent as Googlebot is setting your user-agent (or the user-agent of a bot that you've created) to a user-agent that self-identifies your request as coming from Googlebot.

How to Spoof a User-Agent

Why Spoof a User-Agent as Googlebot?

As stated in the introduction of this article:

While there are ways to detect whether a request is actually from Googlebot or whether a the user-agent is being spoofed, many sites will accept the user-agent at face value, which often results in behavior that is somewhere between "moderately amusing" and "occassionally useful."

How to Spoof Your User-Agent as Googlebot

In general, the easiest way to spoof your user agent (when browsing as a human, as opposed to a bot) is to install a user-agent spoofing plugin. This plugin works fine for Googlebot, Bing, and Yahoo!'s crawlers, as well as most common user-agents:

To spoof your user agent when writing a custom bot, a function or method will generally be built into any major programming language or library that is commonly used for web crawling and/or scraping.

Quality Hacks Achievable by Spoofing Your User-Agent as Googlebot

Quora, Forbes, and Tumblr are classic examples of three usecases for spoofing a user-agent as Googlebot:

Getting around flexible sampling restrictions, such as metering and lead-in restrictions (Quora)

Avoiding advertisements (Forbes)

Accessing login-required areas of websites without logging-in (Tumblr)

Quora

Quora.com uses a classic example of flexible sampling. When a logged-out user first clicks through to Quora (likely from a search result, since Quora's traffic is heavily based on generating traffic from longtail KWs in the form of questions), the users can view the full page that they've clicked to. Upon clicking any link to another page, a log-in prompt will pop up, requiring that the user log in to continue.

However, since this would cripple Googlebot's ability to crawl Quora (since Googlebot does not log into websites), this use of flexible sampling would severely hinder Quora's ability to rank in the SERPs. However, Quora makes an exception for Googlebot by checking for the Googlebot user-agent. Setting your own user-agent to Googlebot will allow you to get around this restriction as well.

Forbes

The fucking massive advertisement that Forbes shows when users first land on the site is worse than AIDS. Spoofing your user-agent as Googlebot will prevent this advertisement from showing, since otherwise Forbes would be much more obnoxious and time-consuming to crawl.

Of course, now you're reading content on Forbes.com, which is somehow even worse.

Tumblr

Until roughly four seconds ago, Tumblr consisted entirely of Internet pornography. Often, search results would appear for pages that would then require you to log in. Graphic and/or adult content required users to be logged in, since this content was placed behind an age-verification/NSFW filter.

However, since Googlebot does not log into websites, an exception is made for Googlebot. By spoofing your user-agent to Googlebot, you can access Tumblr's age-restricted material without logging in.

Related Content

Share This Post