A Guide to Automating & Scraping the Web with JavaScript (Chrome + Puppeteer + Node JS)

Learn to Automate and Scrape the web with Headless Chrome

Udemy Black Friday Sale — Thousands of Web Development & Software Development courses are on sale for only $10 for a limited time! Full details and course recommendations can be found here.

What Will We Learn?

In this tutorial you’ll learn how to automate and scrape the web with JavaScript. To do this, we’ll use Puppeteer. Puppeteer is a Node library API that allows us to control headless Chrome. Headless Chrome is a way to run the Chrome Browser without actually running Chrome.

If none of that makes any sense, all you really need to know is that we’ll be writing JavaScript code that will automate Google Chrome.

Before Starting

Before starting you’ll need to have Node 8+ installed on your computer. You can install it here. Make sure to choose the “Current” version as it is 8+.

If you’ve never worked with Node before and want to learn, check out: Learn Node JS — The 3 Best Online Node JS Courses.

Once you have Node installed, create a new project folder and install Puppeteer. Puppeteer comes with a recent version of Chromium that is guaranteed to work with the API:

npm install --save puppeteer

Example #1 — Taking a Screenshot

Once you have Puppeteer installed, we’re going to walk through a simple example first. This example is straight from the Puppeteer documentation (with minor changes). The code we’ll walkthrough will take a screenshot of any website you tell it to.

To start out, create a file named test.js and copy in the below code:

Let’s walk through this example line by line.

Line 1: We require our Puppeteer dependency that we installed earlier

We require our Puppeteer dependency that we installed earlier Line 3–10: This is our main function getPic() . This function will hold all of our automation code.

This is our main function . This function will hold all of our automation code. Line 12: On line 12 we are invoking our getPic() function. (Running the function).

Something important to note is that our getPic() function is an async function and makes use of the new ES 2017 async/await features. Because this function is asynchronous, when it is called it returns a Promise . When the async function finally returns a value, the Promise will resolve (or Reject if there is an error).

Since we’re using an async function, we can use the await expression which will pause the function execution and wait for the Promise to resolve before moving on. It’s okay if none of this makes sense right now. It will become clearer as we continue with the tutorial.

Now that we’ve outlined our main function, lets dive into its inner workings:

Line 4:

const browser = await puppeteer.launch();

This is where we actually launch puppeteer. We’re essentially launching an instance of Chrome and setting it equal to our newly created browser variable. Because we’ve also used the await keyword, the function will pause here until our Promise resolves (until we either successfully created our instance of Chrome, or errored out)

Line 5:

const page = await browser.newPage();

Here we create a new page in our automated browser. We wait for the new page to open and save it to our page variable.

Line 6:

await page.goto('https://google.com');

Using our page that we created in the last line of code, we can now tell our page to navigate to a URL. In this example, we’re navigating to google. Our code will pause until the page has loaded.

Line 7:

await page.screenshot({path: 'google.png'});

Now we’re telling Puppeteer to to take a screenshot of the current page . The screenshot() method takes an object as a parameter which is where we can customize the save location of our .png screenshot. Again, we’ve used the await keyword, so our code pauses while the action occurs.

Line 9:

await browser.close();

Finally, we have reached the end of the getPic() function and we close down our browser .

Running the Example

You can run the sample code above with Node:

node test.js

And here’s the resulting screenshot:

Awesome! For added fun (and easier debugging) we can run our code in a non-headless manner.

What exactly does this mean? Try it out for yourself and see. Change line 4 of your code from this:

const browser = await puppeteer.launch();

to this:

const browser = await puppeteer.launch({headless: false});

And then run again with Node:

node test.js

Pretty cool huh? When we run with {headless: false} you can actually watch Google Chrome work as it navigates through your code.

We’re going to do one last thing with this code before moving on. Remember how our screenshot was a little off center? Well that’s because our page was a little small. We can change the size of our page by adding in this line of code:

await page.setViewport({width: 1000, height: 500})

Which results in this much nicer looking screenshot:

Here’s what our final code for this example looks like:

Example #2 — Lets Scrape some Data

Now that you know the basics of how Headless Chrome and Puppeteer Work, lets look at a more complex example where we actually get to scrape some data.

First, take a look at the API documentation for Puppeteer Here. As you’ll see, there’s a TON of different methods we can use to not only click around on a website, but also to fill out forms, type things, and read data.

In this tutorial we’re going to scrape Books To Scrape, which is a fake bookstore specifically setup to help people practice scraping.

In the same directory create a file named scrape.js and insert the following boilerplate code:

Ideally the above code makes sense to you after going through the first example. If not, that’s ok!

All we’re doing above is requiring the previously installed puppeteer dependency. Then we have our scrape() function where we will input our scraping code. This function will return a value. Finally, we invoke our scrape function and handle the returned value (log it to the console).

We can test the above code by adding in a line of code to the scrape function. Try this out:

let scrape = async () => {

return 'test';

};

Now run node scrape.js in the console. You should get test returned! Perfect, our returned value is being logged to the console. Now we can get started filling out our scrape function.

Step 1: Setup

First thing we need to do is create an instance of our browser, open up a new page, and navigate to a URL. Here’s how we do that:

let scrape = async () => {

const browser = await puppeteer.launch({headless: false});

const page = await browser.newPage();

await page.goto('http://books.toscrape.com/');

await page.waitFor(1000); // Scrape browser.close();

return result; };

Awesome! Lets break it down line by line:

First, we create our browser and set headless mode to false . This allows us to watch exactly what is going on:

const browser = await puppeteer.launch({headless: false});

Then, we create a new page in our browser:

const page = await browser.newPage();

Next, we go to the books.toscrape.com URL:

await page.goto('http://books.toscrape.com/');

Optionally, I’ve added in a delay of 1000 milliseconds. While normally not necessary, this will ensure everything on the page loads:

await page.waitFor(1000);

Finally, after everything is done, we’ll close the browser and return our result.

browser.close();

return result;

Setup is complete. Now, lets scrape!

Step 2: Scraping

As you’ve probably ascertained by now, Books to Scrape has a big library of real books and fake data on those books. What we’re going to do is select the first book on the page and return the title and price of that book. Here’s the homepage of Books to Scrape. I’m interested in clicking on the first book (outlined in red below)