Table of Contents Introducing web scraping

Some use cases of web scraping

How does it work?

Robots.txt

A simple example

Working with HTML

Data processing

Next steps Introducing web scraping Simply put, web scraping is one of the tools developers use to gather and analyze information from the Internet. Some websites and platforms offer application programming interfaces (APIs) which we can use to access information in a structured way, but others might not. While APIs are certainly becoming the standard way of interacting with today’s popular platforms, we don’t always have this luxury when interacting with most of the websites on the internet. Rather than reading data from standard API responses, we’ll need to find the data ourselves by reading the website’s pages and feeds. Some use cases of web scraping The World Wide Web was born in 1989 and web scraping and crawling entered the conversation not long after in 1993. Before scraping, search engines were compiled lists of links collected by the website administrator, and arranged into a long list of links somewhere on their website. The first web scraper and crawler, the World Wide Web Wanderer, were created to follow all these indexes and links to try and determine how big the internet was. It wasn’t long after this that developers started using crawlers and scrapers to create crawler-based search engines that didn’t require human assistance. These crawlers would simply follow links that would come across each page and save information about the page. Since the web is a collaborative effort, the crawler could easily and infinitely follow embedded links on websites to other platforms, and the process would continue forever. Nowadays, web scraping has its place in nearly every industry. In newsrooms, web scrapers are used to pull in information and trends from thousands of different internet platforms in real time. Spending a little too much on Amazon this month? Websites exist that will let you know, and, in most cases, will do so by using web scraping to access that specific information on your behalf. Machine learning and artificial intelligence companies are scraping billions of social media posts to better learn how we communicate online. So how does it work? The process a developer builds for web scraping looks a lot like the process a user takes with a browser: A URL is given to the program. The program downloads the response from the URL. The program processes the downloaded file depending on data required. The program starts over at with a new URL The nitty gritty comes in steps 3 and, in which data is processed and the program determines how to continue (or if it should at all). For Google’s crawlers, step 3 likely includes collecting all URL links on the page so that the web scraper has a list of places to begin checking next. This is recursiveby design and allows Google to efficiently follow paths and discover new content. There are many heavily used, well built libraries for reading and working with the downloaded HTML response. In the Ruby ecosystem Nokogiri is the standard for parsing HTML. For Python, BeautifulSoup has been the standard for 15 years. These libraries provide simple ways for us to interact with the HTML from our own programs. These code libraries will accept the page source as text, and a parser for handling the content of the text. They’ll return helper functions and attributes which we can use to navigate through our HTML structure in predictable ways and find the values we’re looking to extract. Scraping projects involve a good amount of time spent analyzing a web site’s HTML for classes or identifiers, which we can use to find information on the page. Using the HTML below we can begin to imagine a strategy to extract product information from the table below using the HTML elements with the classes products and product .

<table class="products"> <tr class="product">...</tr> <tr class="product">...</tr> </table>

In the wild, HTML isn’t always as pretty and predictable. Part of the web scraping process is learning about your data and where it lives on the pages as you go along. Some websites go to great lengths to prevent web scraping, some aren’t built with scraping in mind, and others just have complicated user interfaces which our crawlers will need to navigate through. Robots.txt While not an enforced standard, it’s been common since the early days of web scraping to check for the existence and contents of a robots.txt file on each site before scraping its content. This file can be used to define inclusion and exclusion rules that web scrapers and crawlers should follow while crawling the site. You can check out Facebook’s robots.txt file for a robust example: this file is always located at /robots.txt so that scrapers and crawlers can always look for it in the same spot. Additionally, GitHub’s robots.txt, and Twitter’s are good examples. An example robots.txt file prohibits web scraping and crawling would look like the below:

User-agent: *

Disallow: / The User-agent: * section is for all web scrapers and crawlers. In Facebook’s, we see that they set User-agent to be more explicit and have sections for Googlebot, Applebot, and others. The Disallow: / line informs web scrapers and crawlers who observe the robots.txt file that they aren’t permitted to visit any pages on this site. Conversely, if this line read Allow: / , web scrapers and crawlers would be allowed to visit any page on the website. The robots.txt file can also be a good place to learn information about the website’s architecture and structure. Reading where our scraping tools are allowed to go – and not allowed to go – can help inform us on sections of the website we perhaps didn’t know existed, or may not have thought to look at. If you’re running a website or platform it’s important to know that this file isn’t always respected by every web crawler and scraper. Larger properties like Google, Facebook, and Twitter respect these guidelines with their crawlers and information scrapers, but since robots.txt is considered a best practice rather than an enforceable standard, you may see different results from different parties. It’s also important not to disclose private information which you wouldn’t want to become public knowledge, like an admin panel on /admin or something like that.

Want to Code Faster? Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance. Start coding faster today. Send Download Link Download Kite Free

A simple example To illustrate this, we’ll use Python plus the BeautifulSoup and Requests libraries.

import requests

from bs4 import BeautifulSoup



page = requests.get('https://google.com')

soup = BeautifulSoup(page.text, 'html.parser')

We’ll go through this line-by-line:

page = requests.get('https://google.com')

This uses the requests library to make a request to https://google.com and return the response.

soup = BeautifulSoup(page.text, 'html.parser')

The requests library assigns the text of our response to an attribute called text which we use to give BeautifulSoup our HTML content. We also tell BeautifulSoup to use Python 3’s built-in HTML parser html.parser . Now that BeautifulSoup has parsed our HTML text into an object that we can interact with, we can begin to see how information may be extracted.

paragraphs = soup.find_all('p')

Using find_all we can tell BeautifulSoup to only return HTML paragraphs <p> from the document. If we were looking for a div with a specific ID ( #content ) in the HTML we could do that in a few different ways:

element = soup.select('#content')

# or

element = soup.find_all('div', id='content')

# or

element = soup.find(id='content')

In the Google scenario from above, we can imagine that they have a function that does something similar to grab all the links off of the page for further processing:

links = soup.find_all('a', href=True)

The above snippet will return all of the <a> elements from the HTML which are acting as links to other pages or websites. Most large-scale web scraping implementations will use a function like this to capture local links on the page, outbound links off the page, and then determine some priority for the links’ further processing. Working with HTML The most difficult aspect of web scraping is analyzing and learning the underlying HTML of the sites you’ll be scraping. If an HTML element has a consistent ID or set of classes, then we should be able to work with it fairly easily, we can just select it using our HTML parsing library (Nokogiri, BeautifulSoup , etc). If the element on the page doesn’t have consistent classes or identifiers, we’ll need to access it using a different selector. Imagine our HTML page contains the following table which we’d like to extract product information from: NAME CATEGORY PRICE Shirt Athletic $19.99 Jacket Outdoor $124.99 BeautifulSoup allows us to parse tables and other complex elements fairly simply. Let’s look at how we’d read the table’s rows in Python:

# Find all the HTML tables on the page

tables = soup.find_all('table')



# Loop through all of the tables

for table in tables:

# Access the table's body

table_body = table.find('tbody')

# Grab the rows from the table body

rows = table_body.find_all('tr')



# Loop through the rows

for row in rows:

# Extract each HTML column from the row

columns = row.find_all('td')



# Loop through the columns

for column in columns:

# Print the column value

print(column.text)