A lot of old and not updated articles are on the web about tools for web scraping and they are just not what you’d expect in 2018.

We’ve got a lot of updates in this field in the last years so I must write this so that people stop using old examples of code from the internet and old libraries.

So, in this article I want to talk about the tools that you should use in 2018 for web scraping with NodeJs.

Table of contents

Let’s start with it!

Libraries

Here are all the libraries that helped me do all the necessary work for my scraping stuff and I’ve been glad I’ve used them / still use them.

1. Puppeteer

Puppeteer is just one of the best scraping tools that is not actually meant for scraping but it is a great solution.

Puppeteer is a NodeJs library that lets you automate the Chrome / Chromium browser with a great API.

Find out more about Puppeteer in my previous article, NodeJs Scraping with Puppeteer

I’m gonna go right into the Pro’s and Con’s for Puppeteer

When to use it

Puppeteer is best used when you want to create a fast scraper and have something working as soon as possible, It just works.

Other cases are when you need to wait for actual page renderings for websites built with Angular, Reach, JSX and others that have dynamic html content.

Pro’s

Very good and proper documentation

Easy and fast to start with and to use

Variety of API with a lot of functionality

Fully working with ASYNC AWAIT

Con’s

Literally the single thing that is yet to be officially implemented is the option to easily download files.

Other than that, still had no problem with it or found any other shitty things about it.

Puppeteer Documentation

2. Request-Promise

Request-Promise is basically a variation of the actual request library from npm that has almost all of the normal functions promisified which means you can chain actions and wait for them to finish.

I’ve also written about how you can make a scraper using Request Promise and Cheerio, so you can check it out.

When to use it

Compared to Puppeteer, this is the exact opposite case of use.

Request Promise / Request should be used when your content is not dynamically rendered and when you get everything that you want to scrape, in your actual response from the server.

Instagram, for example, renders after every resource has been loaded and it generates the html and everything. So in the case that you want to scrape a page from instagram, you will only get the initial content ( pre rendering ), which can be either good or bad, depending on the use case.

Pro’s

A faster solution than with an automated browser

Good documentation and community around it with a lot of examples



Con’s

Can get tricky with more complex requests

You need to be very careful with sessions, cookies, authentication and a lot of things depending on what site you are scraping

Using Request / Request-Promise is a more advanced solution when dealing with websites that have an authentication system and / or check for exact specific headers in the request

Request-Promise on GitHub

3. Cheerio

Cheerio is a library that you must learn, it simplifies a lot of things for you when creating scrapers and especially dealing with html content parsing.

It acts like jQuery, so if you already know some jQuery, this will seem familiar to you.

I’ve also written a scraper example with Cheerio on my previous blog post.

When to use it

Almost all the time when you are dealing with raw HTML content, use cheerio.

Cheerio on GitHub

4. NightmareJs

NightmareJs is another high-level browser automation library that runs Electron as a browser.

It is around for more time than Puppeteer but since Puppeteer is on the field, they very fast grew over Nightmare and took over.

Nightmare can still be a considered solution for some cases in 2018 also, that is why I am mentioning it.

When to use it

I actually recommend using NightmareJs over Puppeteer in some cases and those would be when you don’t actually need everything that Puppeteer has to offer.

Nightmare is a condensed version / simplified of Puppeteer basically, it has a lot less functions but they are straight to the point.

Pro’s

Been around for many years ( almost 5 already )

Has plugins which give you more flexibility ( including support for downloads of files )



Con’s

Not that updated

No support for multiple tabs at once

Can be buggy when dealing with complicated / advanced scrapers.

Conclusion

Even after reading what I just said, please take in mind that there are a lot more tools and resources that you can work with. It just depends on what you’re trying to do and how.

With these 4 tools I cover 95% of what I need to do in terms of scraping, but that, of course, it is specific to me. Your case may differ.

I’m really open to suggestions if you have some, really want to give a try other tools as well.

So, let me know in the comments 🔥

Want to learn more?

Also if you want to learn more and go much more in-depth with the downloading of files, I have a great course with more hours of content on web scraping with nodejs.