Do not protect your website from scraping (part 1, technology barriers)

Resistance is futile

In the last decade, I have done a lot of projects that involve content aggregation and analyses. Often content aggregation involves obtaining data from third-party websites, i.e. scraping. However, nowadays I avoid using this term as much as possible. There is a stigma attached to the word and a lot of misconceptions. The primary misconception is that web scraping can be blocked using X, Y, Z.

tl;dr; It cannot.

Note: This article refers to GO2CINEMA business. The business has since pivoted and has a new name Applaudience (https://applaudience.com/). We are aggregating admissions data from all the major cinemas in the EU and forecast future audience behaviour.

Business people perspective

Last week I have met with a high-profile executive from the industry in which I am developing my business, GO2CINEMA. No doubt he is one of the smartest and knowledgeable persons in the cinema industry.

GO2CINEMA business model relies on aggregating information from many different sources about showtimes, seat availability and ticket pricing, and even executing purchase requests on those websites on behalf of the user.

I asked this person for help with fundraising. He offered his help and asked me to prepare analyses of all the ways my current business could be blocked, including content scraping (from the technology and legal perspective). I have prepared the requested documents and shared with him before our meeting. His feedback was along the lines of:

You have done a thorough research. However, there are ways to block you.*smirk*

No, there aren’t.

Real users and bots do not differ

Non-technical people have this romanticised picture of programming that it is akin to 80s computer games – you put a VR headset and immerse into the Internet. In reality, all information and interactions are ones and zeroes. There is no human touch. No distinction between input generated by computer or human.

Inspecting web traffic.

Let me keep this simple — as long your customers can access the content on your website, the same content can be accessed by a bot. All technology solutions to deter scraping will encumber real users as much as they will affect the bots.

If you are not convinced, lets go through all the technical ways a website could try to block a bot, using my business as an example of a bot operator.

Technical countermeasures

Despite how silly some of these concerns might sound to a technical person, these are all real concerns expressed by investors that I had to answer. Bear with me.

Blocking HTTP User Agent

Every HTTP request contains HTTP headers, including User Agent – identification of the HTTP client. Therefore, a cinema could identify bots using the information in the HTTP headers and block them.

Solution: fake HTTP headers to mimic real users.

Example: GO2CINEMA bots are using HTTP headers that imitate real-user browsing session (e.g. using Google Chrome browser). HTTP headers are randomized between every scraping session.

Conclusion: It is impossible to block GO2CINEMA bots using client-sent HTTP meta-data such as HTTP headers without blocking real users.

Blocking IPs

A cinema could try to identify and block IPs used by GO2CINEMA bots.

Solution: “fake” IPs (use proxies).

Identifying as many different people.

Example:

GO2CINEMA uses a combination of request scheduling and IP rotation techniques to avoid creating identifiable bot-behaviour patterns. Some precautions include:

Randomizing IPs used to access the content. Allocating IPs that are as much as possible geographically near to the target venue. Persisting of the allocated IP for the duration of the scraping session. Proxy pool (IPs) is rotated every 24 hours.

It is worth noting that the current setup is suffering from one flaw: IPs (proxies) used to make the requests are registered to various data centers (as opposed to residential IPs). In theory, a cinema could obtain a list of all subnets used by data centers in the UK and block these ranges. This would successfully block the current setup. However,

This has an associated cost. Example providers include MaxMind (Anonymous IP Database service. Pricing not public.) and Blocked (USD 12,000/ year). It would risk blocking real-users.

https://www.netflix.com/ is an example of a service provider that blocks IPs known to be used as VPNs and proxies.

In case cinemas would start blocking data-center IPs, the solution is to start using residential IPs. As the name implies, residential proxies allow to make HTTP requests using IPs registered by private individuals. https://luminati.io/ is an example of a residential proxy service provider. There are two downsides to residential IPs:

Cost (our current bandwidth would cost GBP 1,000.00/ month). Reliability. Residential proxy behavior (speed) can be unpredictable.

Purportedly, some cinemas have already attempted to block GO2CINEMA bot IPs. Based on a confidential source, we know that cinema X think (or at least thought) that they have successfully blocked GO2CINEMA IPs. However, this is not true. GO2CINEMA bot activity has not been interrupted. It appears that the cinema X have blocked someone else who have been collecting equivalent data.

It is important to emphasize that it is (theoretically) possible to distinguish which HTTP requests are created by a bot (as opposed to a human being) by analysing browsing patterns (see “Adding invisible CAPTCHA”). However, it would be extremely difficult to identify which HTTP requests are created specifically by GO2CINEMA bots (for reasons discussed in Blocking HTTP User Agent).

Conclusion: It would be extremely hard to block GO2CINEMA bots using IP blacklist because (1) it is extremely hard to identify GO2CINEMA bot activity and because (2) we have access to a large number of data-center and residential IPs.

Blocking IPs would not restrict GO2CINEMA screen scraping the cinema website.

User journey pattern analyses

The most advanced bot detection systems use machine learning to identify regular user journey patterns (navigation path, average time per interaction, etc). Such systems sample millions of interactions over an extended period of time and warn about the outliers. Chances are that you’ve encountered this type of bot protection yourself…

Google temporarily restricting IPs access to use its search engine.

Google is at the forefront of bot detection business. It must be — most of Google revenue comes from the ads business and online ads have been the target of manipulation since the first day of online advertising (think website owners artificially increasing clickthrough numbers). As a result, Google invested heavily into bot detection to protect its ads business. However, even Google cannot get it always right. If you are working in a large office, chances are that you see the “We’re sorry… but your query looks similar to automated requests from a computer virus or spy ware application” at least couple of times a day. This happens because hundreds of people that are using Google from the same office building create behaviour patterns that fall into the outlier group, i.e. Google cannot accurately distinguish between bots and real-users (and neither can any of the commercial service providers that offer similar user journey pattern based bot protection).

For the record, this does not mean that Google ad business is vulnerable to bots. Unlike website scraping, the ad campaign performance data is available only in aggregate form and only after certain amount of information is aggregated, i.e. If you use a bot to automate interactions with an ad, there is no way to determine if the bot is successful at what it is doing (as there is no real-time, non-aggregate feedback). This makes bot detection for the specific use case of ad fraud detection a lot simpler.

Adding CAPTCHA

A cinema could add a CAPTCHA as a way of either restricting access to certain parts of the website (e.g. viewing auditorium utilisation) or limiting certain actions (e.g. completing payment transaction).

Solution: use APIs that solve CAPTCHA.

Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA).

Adding a CAPTCHA would only inconvenience an average real user. All CAPTCHA methods (including reCAPTCHA by Google ) are easily bypassable using external services such as https://2captcha.com/ . These services work by tasking a real human being to solve the task presented to our bot. The cost of such services is minimal (e.g. GBP 2 per 1000 tasks).

Conclusion: Adding a CAPTCHA would not restrict GO2CINEMA screen scraping the cinema website.

Adding invisible CAPTCHA

Cinemas could deploy behaviour based bot identification and blocking mechanisms (sometimes referred to as “invisible CAPTCHA”).

Invisible CAPTCHA is a technology that uses a combination of many different variables to assess the likelihood that interactions by a specific client are automated. There is no one recipe for how it should be implemented. Different service providers use different data points to construct the user profile. This service is provided by some CDNs (e.g. Cloudflare) and traditional CAPTCHA providers, such as Google reCAPTCHA.

However, according to former Google “click fraud czar” Shuman Ghosemajumder, this capability “creates a new sort of challenge that very advanced bots can still get around, but introduces a lot less friction to the legitimate human.”

Conclusion: Adding behaviour based bot identification would not restrict GO2CINEMA screen scraping the cinema website, it would be just another challenge to workaround.

Adding WAF

WAF (Web Application Firewall) is generally defined as a security enforcement policy positioned between a web application and a client. Some well known examples include Akamai, Cloudflare and Incapsula. However, there are hundreds, and they all have known fingerprints and workarounds. They all use a combination of security policies already mentioned in this article.

Do they work? No. For the same reason that other methods can be bypassed individually, any WAF can be bypassed too.

However, for what it is worth, you are likely going to be protected more by the more obscure than the well known WAFs. This is simply because many people have a need to bypass the popular WAFs on daily basis and document their experiences. Just Google “bypass cloudflare WAF” and hundreds of results will appear.

Adding email verification

A cinema could require that a valid email is presented to complete certain actions, e.g. making a reservation.

Solution: use “burner” email addresses.

By now I am just adding these pictures because I like the artist — Kidmograph.

Example:

At the current time, cinemas already require that email is provided as part of a cinema ticket purchase form. GO2CINEMA is using “go2cinema.mail” domain name for the purpose of obtaining reservation confirmation. A new email is generated for every transaction (e.g. john1@go2cinema.mail”). Emails sent to the generated inbox are not accessible to the GO2CINEMA customer.

The current GO2CINEMA approach has the benefit of:

Limiting cinemas’ ability to track individual customer activity. Preventing cinemas from sending marketing emails to GO2CINEMA customers.

However, the downside of the current approach is that it allows cinemas to easily identify and block transactions done by GO2CINEMA.

Should cinemas start to actively block “go2cinema.mail” as a valid email domain, we could do either of the following:

Buy in bulk thousands of cheap domains. Use either of the existing services that provide temporary email addresses (e.g.https://www.mailinator.com/ ). Abuse one of the existing major email providers to create temporary email inboxes (e.g.Yahoo, gmail). Expose real user email addresses.

Conclusion: Adding email verification would not restrict GO2CINEMA screen scraping the cinema website.

Adding mobile verification

A cinema could require user to provide a valid mobile phone number in order to complete a purchase transaction.

Solution: use “burner” mobile numbers.

Assuming that the validation process involves a callback action (e.g. enter numbers sent to your phone number), we could use either of the virtual phone providers (e.g. https://www.twilio.com/ ) to issue a temporary mobile phone number.

Unlike the low cost of the temporary email addresses, a virtual phone number costs relatively a lot (e.g. 1 GBP per month, per phone number).

On the other hand, adding mobile verification as a requirement would negatively impact the cinema in multiple ways:

Cost of the SMS verification messages. Loss of clients who do not have mobile phone number or are not willing to share it Decrease of the booking conversation from the product perspective (from practise this is a common observation especially in e-commerce sectors)

This would be a fairly extreme and unprecedented measure by a cinema.

Conclusion: Adding a mobile verification would not restrict GO2CINEMA screen scraping the cinema website.

Blocking BIN

A cinema could block the Bank Identification Number (BIN).

Solution: sue or use common bank to issue cards (e.g. Barclays).

Example:

GO2CINEMA uses virtual debit cards to purchase tickets from the cinema. GO2CINEMA uses https://entropay.com/ to issue every customer a new virtual debit card (Mastercard). Entropay operates as a bank, i.e. all the cards issued by Entropay start with BIN 522093. In theory, a cinema could block this BIN.

A cinema cannot block this BIN without breaking its contract with the payment gateway. Every payment gateway contract includes a Honor All Cards clause. In case of MasterCard, the Honor All Cards policy is listed in the MasterCard rules document :

5.10.1 Honor All Cards

A Merchant must honor all valid Cards without discrimination when properly presented for payment. A Merchant must maintain a policy that does not discriminate among customers seeking to make purchases with a Card.

This is the same technical reason for which cinemas were unable to block MoviePass in the US:

We comply fully with the rules of MasterCard and AMC has signed agreements with both their credit card processor and with MasterCard to comply with all the rules. They would essentially have to not take MasterCard in order to block us.

– http://uk.businessinsider.com/moviepass-amc-theatres-pushback-2017-8?r=US&IR=T

Note that the Honor All Cards clause in Europe is different than in the US. In Europe, a merchant is allowed to block a type of all cards, e.g. all pre-paid cards.

Conclusion: Blocking BIN would not restrict GO2CINEMA screen scraping the cinema website.

Changing website structure

A cinema could change website structure without warning GO2CINEMA that would break our integration.

This suggestion makes a lot of assumptions about how we scrape the content. In most cases, modern scraping techniques do not rely on the website structure. More about this in a latter article.

Assuming that our scraper does depend on the website’s structure:

There is only so often website changes happen.

This is little different from API changes.

Our systems would notify us as soon as this happens.

It would affect real users as much as bots.

Conclusion: Changing website structure is not an effective long term strategy to block web scrapers.

Protecting the API with an API key

Another business man (a cinema owner) suggested that a cinema could block GO2CINEMA by restricting access to the API using an API key.

He: Cinema can simply update their APIs to require an API key.

Me: Which API?

He: The API used to access showtimes.

Me: Is this information published on the website?

He: Yes.

Me: Then the browser client needs to have access to this API to view the content on the website.

The specific website that he used as an example had the API key hard-coded in the source code.

Conclusion: Restricting access to an API using an API key is not an effective strategy to restrict scraping when it is a public API.

Summary

There is no currently identified way of blocking a particular bot from accessing content on the cinema website using technological barriers. All of the listed mechanisms would only work as a deterrence.

It is worth emphasizing that while none of these methods can block bot access, addition of some or all of these mechanism would come at a high cost for the cinema in terms of (1) technical development cost and (2) deteriorated real user experience.

Blocking 98% of the scrapers

While you cannot block all the scrapers, you can discourage/ block most of the primitive scraping attempts by using a combination of the above methods.

Whether this is worth your effort depends on:

What impact do the scraper bots have to your website/ business?

Will it impact real users?

More often than not, the answer is that it is not worth it.

Legal barriers

Website owner cannot block your bots using technology. However, can a website owner block you using legal means?

The short answer is: no (or at least extremely unlikely, difficult and would take many years).

In the next article, I will share a summary of over a dozen of legal cases that I used to assess the legal climate in Europe for a business that relies on web scraping.

If you are just starting your journey into the legal aspects of scraping, research all the ticket aggregators in the airlines industry (e.g. Momondo, Skyscanner, Kyak). All of them use some element of scraping to get their data, sometimes as far as to represent a real-user for the purpose of buying a ticket. Most of them have been involved in some legal cases related to scraping, and all of the cases that we have found, were ruled in our favour. If your business is US focused, then there is even a stronger precedent – see hiQ vs LinkedIn.

Closing notes

Seems like a lot of people reading this article assume that we’ve been through a technology war with the cinemas (see Reddit comments). With didn’t.

With the exception of a second-hand account about cinema X attempting to block our IPs (mentioned in the “Blocking IPs” paragraph), none of the cinemas attempted to block us using either of the above methods. The purpose of this article is to share a summary of what-if scenarios that we prepared as a fallback plan as part of raising VC money.

In case of the cinemas, most of the cinemas are dinosaurs, still using fax machines for day to day communications and excel spreadsheets for managing showtime schedules. They cannot afford/ or do not see the need to have APIs and are completely aware and happy with other businesses scraping showtimes from their website.

If you are planning to scrape content from third-party websites, first check with the website owner if there is an API. This will save time and money to you and the website owner.

You like to read, I love to write

You can support my open-source work and me writing technical articles through Buy Me A Coffee and Patreon. You’ll have my eternal gratitude 🙌