In this post we are going to talk about what web scraping is, its uses and when you should choose another option. We are also going to go over some applications of this technique.

What is web scraping?

Web scraping is a technique to extract information from websites by using software programs to automate the process. These programs usually make HTTP requests directly from the code or simulate human behavior by embedding a browser inside the application. One famous example could be GoogleBot, Google’s web scraper, that gathers information from all the web in order to classify and rank sites for their search engine.

Let's scrap some data!

In the following example, we are going to get some data out of a well known online marketplace in order to gather some data that we could use later to feed a machine learning algorithm.

Before writing code

Before we start coding our scraper, it’s necessary to understand the page that we are going to be working on, the flow and the elements that we are going to be interacting with our code.

Let’s say that our machine learning algorithm is going to be able to predict a product’s category only by its title. In order to train it we are going to need lots of products titles and their corresponding category.

From here we can infer that our scraper will need to write a term in the search bar (red circle) and then press the search button (green circle).

This will take us to a view where we can see the categories that our search was related to (red circle) and the product title for each of the found results (green line). We will need our scraper to extract that information.

Only one page of product titles wouldn’t be enough data to feed a robust AI algorithm, so our scraper needs to find the pagination buttons (red line), visit page by page and extract all the titles until the next button (green line) is no longer visible (that means that we are already on the last page).

Prerequisites