Source: Claire Anderson, unsplash.com

Web scraping is now legal

Here’s what that means for Data Scientists

This story was sponsored by https://www.corbettanalytics.com.

In late 2019, the US Court of Appeals denied LinkedIn’s request to prevent HiQ, an analytics company, from scraping its data.

The decision was a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is fair game for web crawlers.

But commercial use of scraped data is still limited

The decision does not, however, grant HiQ or other web crawlers the freedom to use data obtained by scraping for unlimited commercial purposes.

For example, a web crawler would be allowed to search Youtube for video titles, but it could not re-post the Youtube videos on its own site, since the videos are copyrighted.

In general, the copyright for data, including data for media files like video or music, is still enforceable regardless of how the data was obtained.

Some forms of web scraping are also still illegal

The decision also does not grant web crawlers the freedom to obtain data from sites that require authentication.

For example, a web crawler that logged-in to Facebook and downloaded user data would not be permitted by the ruling.

The ruling excludes sites that require authentication because users must agree to the site’s Terms of Service before logging-in to the site. Those terms of service typically forbid activity like automated data collection.

But since publicly available sites can not require a user to agree to any Terms of Service before accessing the data, users are free to use web crawlers to collect data from the site.

Sites can still use techniques to limit web crawling

Although companies are less likely to find legal recourse against web crawlers today, they’re still free to limit web crawling in other ways.

For example, sites can use techniques like “rate-throttling” to prevent crawlers from downloading too many web pages at once. Sites can also still use technology like CAPTCHA to test whether a human or a web crawler is requesting the page.

Those techniques are typically used to prevent malicious bots that overload the site, causing it to crash. But the techniques may become more commonly used in an effort to make automated scraping less cost-effective for the web crawling companies.

LinkedIn is likely to further appeal the measure

While the US Court of Appeals denied LinkedIn’s request, there is one last possibility for the company: appealing to the US Supreme Court.

The US Supreme Court has the power to overturn the Court of Appeals, and could undo the decision to legalize scraping publicly available, non-copyrighted data.

Not all decisions that are appealed to the Supreme Court are actually reviewed, however.

But it seems fair to say that the Supreme Court has a good chance to review the decisions in this case. Data policy and related privacy concerns are relatively “new” laws, and can have major commercial implications for companies like LinkedIn.