In the past year, I’ve noticed an alarming trend of referral spam creeping into my Google Analytics reports. Referral spam is the practice of sending bogus referral traffic to a website or product. It may sound relatively harmless, but referral spam is quickly turning into a serious issue.



Types of Referral Spam

In the context of Google Analytics, referral spam comes in two main flavors: spammy web crawlers and ghost referral traffic.

Web crawlers are robots that visit websites, usually with the intention of indexing content. Most web crawlers identify themselves as such to web servers and are then left out of analytics reports. However, some web crawlers like those from Semalt (boo!) don’t identify themselves as robots and end up showing up in analytics reports as sessions with a 100% bounce rate and 0 second duration. Google recently introduced a feature to filter out known bots and spiders, though it’s definitely not perfect (more on that later).

Ghost referral traffic, arguably the greater of the two referral spam evils, never actually visits a website. In these cases, spammers exploit the fact that Google Analytics now transfers information via HTTP requests directly to Google Analytics servers, meaning someone can “spoof” a session very easily. Ghost referral traffic can be generated by a simple program that sends fake HTTP requests aimed at different Google Analytics properties, so this traffic doesn’t even hit your site. Even more annoying is the fact that this type of spam can be used to spoof organic search results and send false events, as well. See the screenshot below for an example:

Note: For ghost referral traffic, modifying .htaccess won’t help at all since these spammers never actually visit your site -- for more information view Google's Measurement Protocol documentation.

Negative Implications

“A referrer is a simple HTTP header that's passed along when a browser goes from one page to another page, normally used to indicate where a user's coming from. But users can change it, and some people will set referrer at pages they want to promote and visit tons of people around the web -- people see it and say 'Oh, I should check it out'. It's not necessarily a link… there are some people who try to drive traffic by visiting a ton of websites with an automated script and setting the referrer to be the URL they want to promote... there's no 'authentication'… You can’t automatically assume that it was the owner of the URL if you see something showing up in your dashboard. Somebody is trying to do some hijinx.”

- Matt Cutts, Head of Google Webspam Team



So, why is referral spam so bad? For one, it’s screwing up my web analytics data. “Sessions” entering via referral spam skew the data, clouding the accuracy of engagement metrics and inflating traffic volume metrics. Unfortunately, those unaware of spam issues may base decisions based on inaccurate data, especially for sites with low traffic.

Moreover, referral spam makes SEO more difficult for everyone. One aim of referral spam is to have links from sites that publish their access logs. Some websites publish web analytics data publicly, which can include hyperlinks back to the spammer’s designated URL. These backlinks can improve search engine results for that URL since many websites publishing referrer data are presumably trustworthy.

There are also more nefarious opportunities available to referral spammers. If a spammer wanted to send a website unwanted and unqualified traffic, they could simply change the name of the referral URL to the victim’s URL. As mentioned in the above quote from Matt Cutts, referral spam can’t truly be “authenticated” and tracked back to a specific source. With this in mind, referral spam could be used to harm reputations, possibly framing an innocuous website as a spam referrer.

Exposure to malware is another potential threat to anyone curious enough to visit referral spam addresses. With the rise of electronic data theft, it would be simple for referrer spam networks to point to URLs containing malicious software aimed at stealing valuable information.

Finally, no one wants to be advertised to while looking at web analytics acquisition reports.

Solutions

Within Google Analytics, there are multiple options to remove referral spam:

Exclude Foreign Hostnames and Filter Spammy Crawlers

One defining attribute of many ghost referrals is an inaccurate hostname attribution. When reviewing referral data in Google Analytics, the hostname will be completely unrelated to your website (e.g., “apple.com”). With this knowledge, it’s relatively simple to create a filter to only include data with an accurate hostname. For Google Analytics users with only one or a handful of domains, this solution may be the simplest (check here for a quick refresher on regular expressions in GA):



In most cases, substituting your top domain name for example.com will be sufficient. For multiple domains, check your regular expressions with Regex Pal. This filter will also address the recent uptick in direct traffic with a hostname of "(not set)".

That first filter will remove any ghost referral traffic. However, an additional filter will also be required to remove spammy web crawlers (like Semalt) since they actually visit the site and will report an accurate hostname. A solution to remove the two most popular web crawler offenders can be seen below using an Exclude Campaign Source filter:

Featured Regular Expression:

.*(semalt(media)?|buttons\-for\-website)\.com.*

Note: You should always retain an unfiltered view, as data processed by GA filters cannot be reverted.

Filter All Referral Spam Sources

In cases where domains in a measured view can easily change, blocking referral spam may require a more exhaustive referral filter encompassing all offending referral sites. Over the past few months, I’ve created a list of offending sites and updated the filter accordingly, as seen below. As a quick caveat, while this list targets many of the offending referral spam sources, it’s by no means an exhaustive list.

With the discovery of more spam referrals, I've updated the regular expressions below the image, and this solution will now require two Exclude Campaign Source filters.

In prior versions of this blog post, an Exclude Referral filter was recommended, but it has since been updated to reflect a more appropriate filter, an Exclude Campaign Source filter. S/o to Jordan Strauss for pointing out the issue.

Featured Regular Expressions:

.*((darodar|priceg|buttons\-for(\-your)?\-website|makemoneyonline|blackhatworth|hulfingtonpost|o\-o\-6\-o\-o|(social|(simple|free|floating)\-share)\-buttons)\.com|econom\.co|ilovevitaly(\.co(m)?)|(ilovevitaly(\.ru))|(humanorightswatch|guardlink)\.org).*

Update #1 - I've added another regular expression since the first one has reached the 255 character limit.

.*((best(websitesawards|\-seo\-(solution|offer))|get\-free(\-social)?\-traffic(\-now)?|googlsucks)\.com|(domination|torture)\.ml|((rapidgator\-)?(general)?porn(hub(\-)?forum)?|4webmasters)\.(ga|tk|org|uni)|(buy\-cheap\-online)\.info).*

Update #2 - Yet another regular expression to include.

.*((event\-tracking|semalt(media)?|(100dollars|success)\-seo|chinese\-amezon|e\-buyeasy|rankings\-analytics|rednise|video\-\-production|theguardlan|webmaster\-traffic)\.com|traffic(monetize(r)?|2money)\.(org|com)|pops\.foundation|erot\.co).*

Update #3 - Getting pretty tired of having to add new regular expressions.

.*(((free\-)?(floating|get\-your\-social)\-(share\-)?buttons|hosting\-tracker|alibestsale)\.(com|info)|(justprofit|best\-seo\-software)\.xyz|snip\.to|adf\.ly|copyrightclaims\.org|(black\-friday|cyber\-monday)\.ga).*

Update #4 - More regular expressions.

.*((monitoring(-your)?-success|uptime|free-video-tool|hdmoviecams)\.com|(monetizationking|popads)\.net|rank-checker\.online|(marketland|dominateforex)\.ml|(ownshop|topquality|easycommerce)\.cf|increasewwwtraffic\.info|(unpredictable|getlamborghini)\.ga).*

Update #5 - Additional .xyz & .co spam.

.*((eu-cookie-law-enforcement|social-traffic).*\.xyz|teedle\.co).*

Advanced Segments for Historical Data

Since filters only process data moving forward, use advanced segments to review historical data from before filters were implemented. Similar to the above solutions, decide which approach is most appropriate for your site and use regular expressions to remove sessions from referral spam, as seen below:



Featured Regular Expressions:

.*((darodar|priceg|buttons\-for(\-your)?\-website|makemoneyonline|blackhatworth|hulfingtonpost|o\-o\-6\-o\-o|(social|(simple|free|floating)\-share)\-buttons)\.com|econom\.co|ilovevitaly(\.co(m)?)|(ilovevitaly(\.ru))|(humanorightswatch|guardlink)\.org).*

Update #1 - I've added another regular expression since the first one has reached the 255 character limit.

.*((best(websitesawards|\-seo\-(solution|offer))|get\-free(\-social)?\-traffic(\-now)?|googlsucks)\.com|(domination|torture)\.ml|((rapidgator\-)?(general)?porn(hub(\-)?forum)?|4webmasters)\.(ga|tk|org|uni)|(buy\-cheap\-online)\.info).*

Update #2 - Yet another regular expression to include.

.*((event\-tracking|semalt(media)?|(100dollars|success)\-seo|chinese\-amezon|e\-buyeasy|rankings\-analytics|rednise|video\-\-production|theguardlan|webmaster\-traffic)\.com|traffic(monetize(r)?|2money)\.(org|com)|pops\.foundation|erot\.co).*

Update #3 - Getting pretty tired of having to add new regular expressions.

.*(((free\-)?(floating|get\-your\-social)\-(share\-)?buttons|hosting\-tracker|alibestsale)\.(com|info)|(justprofit|best\-seo\-software)\.xyz|snip\.to|adf\.ly|copyrightclaims\.org|(black\-friday|cyber\-monday)\.ga).*

Update #4 - More regular expressions.

.*((monitoring(-your)?-success|uptime|free-video-tool|hdmoviecams)\.com|(monetizationking|popads)\.net|rank-checker\.online|(marketland|dominateforex)\.ml|(ownshop|topquality|easycommerce)\.cf|increasewwwtraffic\.info|(unpredictable|getlamborghini)\.ga).*

Update #5 - Additional .xyz & .co spam.

.*((eu-cookie-law-enforcement|social-traffic).*\.xyz|teedle\.co).*

Note: Advanced Segments can be applied retroactively to historical data, while Filters only process data moving forward. If unfamiliar with segments and filters, a quick comparison summary between the two can be found here.