I have researched various aspects of the online advertisement industry for a while, and one of the fascinating topics that I have come across which I didn't know too much about before is ad fraud. You may have heard that this is a huge problem as this topic hits the news often, and after learning more about it, I think of it as one of the major threats to the health of the Web, so it's important for us to be more familiar with the problem.

People have done a lot of research on the topic but most of the material uses the jargon of the ad industry so they may be inaccessible to those who aren't familiar with it (I'm learning my way through it myself!) and also you'd need to study a lot to put a broad picture of what's wrong together, so I decided to summarize what I have learned so far, expressed in simple terms avoiding jargon, in the hopes that it's helpful. Needless to say, none of this should be taken as official Mozilla policy, but rather this is a hopefully objective summary plus some of my opinions after doing this research at the end.

How ad fraud works

Fraudsters have always existed in all walks of life, looking for easy ways of making money. Online ad fraud provides an appealing avenue for fraudsters because of two reasons. One is that once they have a working system capable of generating revenue, they can easily scale it up with almost no extra effort involved, so this gives them the ability to generate a lot of revenue. And we're talking a lot here. To give you a sense of the scale, the infamous Methbot operation which has been well document was generating $3-5 million USD at some point, per day. The other reason is that there is relatively low risk associated with online ad fraud, since depending on the jurisdiction, online ad fraud falls into a legal gray area, and also doesn't involve physical risk as opposed many other types of fraudulent activities.

Ad fraud has been made possible through abusing the quality metrics the ad industry uses to assess the effectiveness of marketing campaigns. For example, historically metrics such as time spent on page, or how often people clicked on an ad (click-through rate) were used, which were trivial to game programmatically. Even when more sophisticated metrics such as percentage of customers achieving a specific marketing goal, such as buying something or signing up for a newsletter were employed, these were implemented through mechanisms such as invisible tracking pixels (1x1 invisible GIFs sending some tracking cookies to the server) which again is trivial to game. These metrics in practice are gamed so much that high rates on these metrics are more associated with bot traffic than actual human customers!

A typical ad fraud scenario today works by automating the process of generating traffic designed to game one of these metrics, and run that on bots across a botnet. These are bots that attempt to act like a human to avoid being detected as a bot (and being block listed or punished by ad networks). These bots also usually aren't simple scripts. They are usually full browser environments, which are either controlled from the outside environment (e.g., through sending the browser mouse/keyboard events, or through embedding APIs) or even by modifying an open source browser! This allows the bot to perform actions on the page, such as add items to a shopping cart, or click on an ad, etc.

It's worth explaining how these botnets are typically run. Botnets usually consists of many hijacked computers connected to the Internet around the world, typically taken over by malware. In fact, a large part of the malware distributed on the Internet is to delivering ad fraud. Hijacking the computer allows the fraudster access to the unique IPs of real users which is helpful for the bot to masquerade a real human. Malware is usually installed in one of the three ways: through Flash vulnerabilities, browser exploits and social engineering. Thankfully Flash in on its path to demise. Browser exploits are a continued challenge which we can directly impact. Social engineering works by tricking the user to download software, e.g. through downloading games, pirated Photoshop copies, etc. It's important to note how there is no absolute path toward closing all the loopholes in the ways in which people's machines get infected by bots.

Botnets can perform things other than ad fraud, such as distributed denial of service (DDoS) attacks, online banking fraud, stealing credit card information, sending spam, etc. But let's only focus on ad fraud. Typically an end-to-end pipeline for ad fraud looks like this:

User's machine gets infected by malware and bot engine gets installed

Bots are instructed to visit high quality sites to pick up the desired tracking cookies (payout opportunity #1)

Bots are then instructed to visit fake site setup by the botnet operator to display ads (payout opportunity #2)

The first payout opportunity for the botnet operator is selling bot traffic to website operators. When website operators are looking for ways to increase traffic to their site, a lot of them resort to purchasing traffic. Unfortunately, a lot of the purchased traffic sources available are either partly or completely bot traffic that come from ad fraud bots. (In some cases the sites end up purchasing this bot traffic unknowingly.) The second payout opportunity for botnet operators is when their bots achieve the goal of the marketing metric they're gaming (e.g., display an ad, or click on it, etc.).

One way to think of ad fraud is finding ad models where a user is tracked from point A to B, where point B do some action to achieve a payout (such as, display an ad on a website, otherwise known as an ad impression), and automate this process and scale it up across a botnet. A botnet is typically a network of compromised machines through malware, these could be anyone's computer at home or at work. There are also botnets that run in data centers, that's the preferred method if the bot doesn't get detected when run inside the data center through simple checks such as IP address range checks.

A popular example is targeting the ad retargeting campaigns where a business buys ads from an ad network for products that customers have tried to buy on online stores. The way that this works is the bot pretends to be a customer by visiting online store websites, searching for products, placing items into the shopping cart, then going to fake sites that have been specifically set up to serve ads from the same ad network and click on the retargeting campaign ads that business has bought. There is a detailed explanation of this setup with graphics available here which I recommend checking out.

Of course there are other types of ad fraud that don't target tracking based models of advertisement. Examples include:

Ad stacking: the practice of loading several ads on top of each other so that only one of them is visible to the user but the fraudster gets paid for displaying all them

Pixel stuffing: the practice of loading one web page as a 1x1 pixel iframe in another web page. Typically the embedder web page is a shady website which is embedding the fraudster's high quality website to drive up the ad revenue from the ads displayed there.

Domain spoofing: some online advertising involves an auction phase before displaying an ad, and during this phase the fraudster can use the domain name of a high quality site to bid for ads and then display them on a shady site

Location fraud: spoofing the real user's location to trick marketing campaigns specific to geographic locations

There are other ad fraud methods and fraudsters are continually coming up with newer ways of defrauding the online advertisers.

How big of a problem is online ad fraud

A lot of research has been done to try to estimate the total size of the online ad fraud revenue. This is interesting to know for some advertisers since money spent on bots viewing and clicking on ads is money spent on ineffective advertisement. Typically the way this research is performed is by measuring the size of the fraud in one specific part of the ad industry and then extrapolating based on that. Based on that, latest estimates for last year (2017) have been raised to around $16.4 billion. To give you a sense of the scale of this number, the IAB estimated the revenue of Internet advertising in the US in the first half of 2017 to be $40.1 billion. This is also a growing problem, and the more recent growth has been seen in mobile, using technologies such as Android test automation software to spawn botnets running on thousands of virtual devices running inside emulators.

Furthermore, as explained above, the characteristics of botnets mean that ad fraud impacts more than the ad industry. This problem impacts consumer device security as it incentivizes malware authors to target normal users to be able to hijack their machines, and it also is harmful to the performance of web pages (see tricks like ad stacking or pixel stuffing which incurs extra needless load on web pages).

What can we do about online ad fraud?

If you have read this far through the post, you should probably be asking yourself, what can be done about online ad fraud, if anything? And what if anything can a web browser do to help with this problem?

Since a few years ago, the ad industry has started to wake up to the existence of this massive issue and have started some countermeasures against the different common fraud types that exist. One common technique among almost all the deployed fraud detection mechanisms is trying to identify human traffic vs. bot traffic. There are a variety of approaches for this. The most simple ones only look at the trail left by the traffic at the network level, such as by analyzing HTTP or TCP/IP traffic logs. This has of course were insufficient as bots have become more advanced, so fraud detection technologies have moved to running their diagnostics code in JavaScript as part of the code responsible for serving advertisements on web pages. Such code looks at many different data sources, such as, things that the browser exposes to the programmatic environment the JavaScript code is running it to detect whether the code is running on a real browser or on a modified browser used for a bot, or doing more advanced analyses such as by listening for mouse movements on the page and analyzing the coordinates the user has moved their mouse on to see whether it follows typical bot generated patterns (bots are very good at moving the mouse in precise straight lines, humans not so much!). Even more sophisticated approaches use various anomaly detection algorithms to try to find some bit of information from the traffic that is “unusual” and classify human vs. bot traffic based on that.

The more you read about ad fraud detection and prevention technologies, the more interesting and advanced techniques you'll find that are being deployed against these bots all the time. That may bring up the following question in your mind: with all this great anti-bot technology, why do we still see so much bot generated online ad fraud, and why is that an increasing and not a decreasing trend? The reasons are… depressingly simple:

Such technology is deployed too late, so a bot developer finds a new technique than no fraud detection software catches, and in a matter of weeks to a couple of months they could have made hundreds of millions of dollars with it. Once fraud detection software catches up, they'd move on to the next technique.

Such technology is deployed in the wrong places. The advertising ecosystem is massive, with many countries and many companies involved in it, and not all these actors are using the same anti-fraud technology. A bot that is detected and blocked in one place of the ecosystem may work well elsewhere.

Such is the nature of all cat and mouse games like this. The bad actors just move on to find the next weak link in your chain once you find them in one place, and they would do what they were doing before there. A game of whack-a-mole without an end in sight.

There are potentially some types of online ad fraud that the browser, through different ways of bending the rules of the Web Platform, could potentially null out. For example cheap tricks like ad stacking and pixel stuffing are at least in theory within the realm of the control of web browsers. But again, we are playing a game of whack-a-mole. If browsers only move on closing those vectors, the fraudsters will move to other existing possibilities of committing online ad fraud, since the other doors would be left wide open for them.

Can the rules of the game be changed?

Without being able to detect the bots at the right places at the right times, it seems pretty hopeless to try to address this issue in the long run. But before giving up all hope and declaring defeat, let's look at the bot-generated online ad fraud issue again and this time break the problem down to its fundamental building blocks:

Advertising networks track a user's online browsing history to be able to show them an ad that would be served on a more expensive website instead on a less expensive website.

This tracking is typically done by setting a third-party cookie and saving it on their computer, or computing a fingerprint of their browser and saving it on a remote server (typically also tied to a computer).

This setup is used to represent human users by advertising networks.

Since there is nothing here that actually ties any of this to a real human, specialized software (bots) can simulate this all outside of the normal web browser, get ads served to them and make money based on that.

This is the how, but it's also important to remember why the fraudsters do this: they want an easy way to make money.

Note the combination of the why and the how: we have a situation where a group of people (fraudsters) are incentivized for financial gain to leverage a huge design flaw (usage of cookies to represent a token tied to a real human) for gaming how the ad industry is set up to serve advertisements.

But what if we lived in a world where we used a different model of online advertisement, such as, a signal-based advertising model, where the value proposition of advertising comes from the advertiser communicating their commitment to the product for a long time by spending money advertising it (this is a nice post countering this model of advertisement against the tracking-based model). That would take away the incentive for the fraudsters to continue to develop new fraud bot technology, by making it financially worthless to view ads or click on them using a bot. The reason is that in such a world, ads that would show up on high quality sites would be more expensive and ads that would show up on low-quality sites that only the bots would visit would in fact not be something that anyone would be paying for! So even if some fraudster would spend the time and money required to develop a new one of these bots in such a world, they wouldn't be able to make any money from it – and they would need to go find some other industry to defraud.

How to make the online ad industry switch to a different advertisement model? Well, as mentioned before the ad industry is an ecosystem with many players, but this is an opportunity. Gradually, consumers have demanded more control over their personal data online. We've seen this have some impact on the legal scene with the European Union about to enforce the GDPR, and at the Web consumer level the market demand has turned into many privacy extensions and browser features. It's only reasonable to expect more in this space as long as there is clear consumer demand for more control over sharing of private data. This means that gradually, it will become harder and harder for these advertising networks to continue with the current practices of constructing the user's browsing history on their servers and target them to serve ads. And the continued existence of online ad fraud means that advertisers who actually pay for online advertisement will continue to bleed marketing budgets going to fraudulent bot traffic. As we expect these trends to continue on their current trajectories, perhaps some day soon marketers will start to put 2 and 2 together and incrementally switch to advertising models that are more compatible with the consumer demands around sharing of data, which also happen to be the right models if you're more interested to target humans with your advertisement than bots.

Key Take-Aways

This was a long article, and I do hope you've made it this far through. Here is a TL;DR section highlighting some of the key points discussed hopefully serving as a useful summary.