What is web scraping?

Web scraping is extracting data from a website. Why would someone want to scrape the web? Here are four examples:

Scraping social media sites to find trending data

Scraping email addresses from websites that publish public emails

Scraping data from another website to use on your own site

Scraping online stores for sales data, product pictures, etc.

Warnings.

Web scraping is against most website’s terms of service. Your IP address may be banned from a website if you scrape too frequently or maliciously.

What will we need?

For this project we’ll be using Node.js. If you’re not familiar with Node, check out my 3 Best Node.JS Courses.

We’ll also be using two open-sourced npm modules to make today’s task a little easier:

request-promise — Request is a simple HTTP client that allows us to make quick and easy HTTP calls.

— Request is a simple HTTP client that allows us to make quick and easy HTTP calls. cheerio — jQuery for Node.js. Cheerio makes it easy to select, edit, and view DOM elements.

Project Setup.

Create a new project folder. Within that folder create an index.js file. We’ll need to install and require our dependencies. Open up your command line, and install and save: request, request-promise, and cheerio

npm install --save request request-promise cheerio

Then require them in our index.js file:

const rp = require('request-promise');

const cheerio = require('cheerio');

Setting up the Request

request-promise accepts an object as input, and returns a promise. The options object needs to do two things:

Pass in the url we want to scrape. Tell Cheerio to load the returned HTML so that we can use it.

Here’s what that looks like:



uri: `

transform: function (body) {

return cheerio.load(body);

}

}; const options = {uri: ` https://www.yourURLhere.com` transform: function (body) {return cheerio.load(body);};

The uri key is simply the website we want to scrape.

The transform key tells request-promise to take the returned body and load it into Cheerio before returning it to us.

Awesome. We’ve successfully set up our HTTP request options! Here’s what your code should look like so far:

const rp = require('request-promise');

const cheerio = require('cheerio');

uri: `

transform: function (body) {

return cheerio.load(body);

}

}; const options = {uri: ` https://www.yourURLhere.com` transform: function (body) {return cheerio.load(body);};

Make the Request

Now that the options are taken care of, we can actually make our request. The boilerplate in the documentation for that looks like this:

rp(OPTIONS)

.then(function (data) {

// REQUEST SUCCEEDED: DO SOMETHING

})

.catch(function (err) {

// REQUEST FAILED: ERROR OF SOME KIND

});

We pass in our options object to request-promise , then wait to see if our request succeeds or fails. Either way, we do something with the returned data.

Knowing what the documentation says to do, lets create our own version:

rp(options)

.then(($) => {

console.log($);

})

.catch((err) => {

console.log(err);

});

The code is pretty similar. The big difference is I’ve used arrow functions. I’ve also logged out the returned data from our HTTP request. We’re going to test to make sure everything is working so far.

Replace the placeholder uri with the website you want to scrape. Then, open up your console and type:

node index.js

{ [Function: initialize]

fn:

initialize {

constructor: [Circular],

_originalRoot:

{ type: 'root',

name: 'root',

namespace: '

attribs: {},

... // LOGS THE FOLLOWING:{ [Function: initialize]fn:initialize {constructor: [Circular],_originalRoot:{ type: 'root',name: 'root',namespace: ' http://www.w3.org/1999/xhtml' attribs: {},...

If you don’t see an error, then everything is working so far — and you just made your first scrape!

Having fun? Want to learn how to build more cool stuff with Node? Check out my 3 Best Node JS Courses

Here is the full code of our boilerplate:

Boilerplate web scraping code

Using the Data

What good is our web scraper if it doesn’t actually return any useful data? This is where the fun begins.

There are numerous things you can do with Cheerio to extract the data that you want. First and foremost, Cheerio’s selector implementation is nearly identical to jQuery’s. So if you know jQuery, this will be a breeze. If not, don’t worry, I’ll show you.

Selectors

The selector method allows you to traverse and select elements in the document. You can get data and set data using a selector. Imagine we have the following HTML in the website we want to scrape:

<ul id="cities">

<li class="large">New York</li>

<li id="medium">Portland</li>

<li class="small">Salem</li>

</ul>

We can select id’s using ( # ), classes using ( . ), and elements by their tag names, ex: div .

$('.large').text()

// New York $('#medium').text()

// Portland $('li[class=small]').html()

// <li class="small">Salem</li>

Looping

Just like jQuery, we can also iterate through multiple elements with the each() function. Using the same HTML code as above, we can return the inner text of each li with the following code:

$('li').each(function(i, elem) {

cities[i] = $(this).text();

}); // New York Portland Salem

Finding

Imagine we have two lists on our web site:

<ul id="cities">

<li class="large">New York</li>

<li id="c-medium">Portland</li>

<li class="small">Salem</li>

</ul>

<ul id="towns">

<li class="large">Bend</li>

<li id="t-medium">Hood River</li>

<li class="small">Madras</li>

</ul>

We can select each list using their respective ID’s, then find the small city/town within each list:

$('#cities').find('.small').text()

// Salem $('#towns').find('.small').text()

// Madras

Finding will search all descendant DOM elements, not just immediate children as shown in this example.

Children

Children is similar to find. The difference is that children only searches for immediate children of the selected element.

$('#cities').children('#c-medium').text();

// Portland

Text & HTML

Up until this point, all of my examples have included the .text() function. Hopefully you’ve been able to figure out that this function is what gets the text of the selected element. You can also use .html() to return the html of the given element:

$('.large').text()

// Bend $('.large').html()

// <li class="large">Bend</li>

Additional Methods

There are more methods than I can count, and the documentation for all of them is available here.

Chrome Developer Tools

Don’t forget, the Chrome Developer Tools are your friend. In Google Chrome, you can easily find element, class, and ID names using: CTRL + SHIFT + C

Finding class names with chrome dev tools

As you seen in the above image, I’m able to hover over an element on the page and the element name and class name of the selected element are shown in real-time!

Limitations

As Jaye Speaks points out:

MOST websites modify the DOM using JavaScript. Unfortunately Cheerio doesn’t resolve parsing a modified DOM. Dynamically generated content from procedures leveraging AJAX, client-side logic, and other async procedures are not available to Cheerio.

Remember this is an introduction to basic scraping. In order to get started you’ll need to find a static website with minimal DOM manipulation.

Go forth and scrape!

Thanks for reading. You should have the tools necessary now to go forth and scrape static websites!

I publish a few articles and tutorials each week, please consider entering your email here if you’d like to be added to my once-weekly email list.

If tutorials like this interest you and you want to learn more, check out my 3 Best Node JS Courses