EDIT: Headless Chrome is shipping in Chrome 59 so the need to use the full Canary path will eventually go away. You can check your Chrome version in the menu under Help > About Google Chrome.

This walkthrough shows you how to get headless Chrome up and running on OSX and explains in detail how to use the code examples provided by the Chrome team.

What problem does Headless Chrome solve?

Headless mode in Chrome is a new way to interact with websites without having to actually have a window up on the screen. This might seem like a trivial improvement but it is actually a huge step forward for scraping data from the web. Currently there are number of stable but informal solutions to scraping such as PhantomJS or NightmareJS (which is written in Electron). Neither or these tools is going away (edit: the PhantomJS sole maintainer has resigned) and they’re still great solutions to scraping. If you have existing systems that are working using these tools, you can keep using them.

With that said, some users have run into trouble working with PhantomJS and Nightmare. Both have caveats when running on a shell-only system (one without an actual screen or window manager). For example, in Nightmare (and any electron app), you would need to install a virtual display manager in order to run the application. Additionally, since Nightmare is Electron based, it has a different security model than Chrome and may fail to catch certain security issues during testing that would happen on a production.

What versions of Chrome supports headless browsing?

Headless Chrome has been released in Chrome 59. As of April 13, 2017 Chrome Canary is the only channel that contains Chrome 59. This means that right now, you need to install Chrome Canary if you want to use headless browsing. This will change in the future and eventually The Chrome Dev Team will bring Chrome 59 into the main Chrome build.

To install Chrome Canary, you can download it or install it with homebrew:

brew install Caskroom / versions / google-chrome-canary

How do I find headless Chrome so that I can start it?

Many of the examples of using headless Chrome just show using a simple chrome command. This is great for Linux but does not work on OSX since that command does not get installed to your path (yet).

So to find Chrome’s path, let’s fire up our terminal to find where Chrome Canary was installed on our system.

sudo find / -type d -name "*Chrome Canary.app"

You’ll probably get some permissions errors but you’ll also get a path that looks something like this:

/ Applications / Google Chrome Canary.app

Since we’ve found the path to Chrome Canary, we can use this to start Chrome in headless mode.

How do I start headless Chrome?

Once we have the path to Canary we need to run a single command to start Chrome as a headless server.

/ Applications / Google\ Chrome\ Canary.app / Contents / MacOS / Google\ Chrome\ Canary --headless --remote-debugging-port = 9222 --disable-gpu https: // chromium.org

Specifically notice that we escaped the spaces in the file name and are looking deep into the Mac .app file to the actual Chrome binary itself. We then passed it the flags needed to start the headless browser and direct it to an initial url of https://chromium.org. The browser is waiting for us to connect on port 9222 to give it further instructions. Keep this tab open and the server running. Open another tab where we’ll connect to the browser and give it some instructions.

How do I scrape data with headless Chrome?

I am going to use Node.js to connect to our running Chrome Canary instance. You’ll need Node installed for this part of the walkthrough.

Let’s generate a generic node project with just one dependency on the Chrome Remote Interface package which will help us communicate with Chrome. We’ll also create a blank index.js file:

mkdir my-headless-chrome && cd my-headless-chrome npm init --yes npm install --save chrome-remote-interface touch index.js

Now we’re going to put some code into our index.js. This is the boilerplate example provided by the Chrome team. It instructs the browser to navigate to github.com and captures all of the network requests made on the page by watching the network property on the client.

const CDP = require ( "chrome-remote-interface" ) ; CDP ( client => { // extract domains const { Network , Page } = client ; // setup handlers Network. requestWillBeSent ( params => { console. log ( params. request . url ) ; } ) ; Page. loadEventFired ( ( ) => { client. close ( ) ; } ) ; // enable events then start! Promise. all ( [ Network. enable ( ) , Page. enable ( ) ] ) . then ( ( ) => { return Page. navigate ( { url : "https://github.com" } ) ; } ) . catch ( err => { console. error ( err ) ; client. close ( ) ; } ) ; } ) . on ( "error" , err => { // cannot connect to the remote endpoint console. error ( err ) ; } ) ;

Finally start our node application.

node index.js

And we’ll see all of the network requests made by Chrome, all without even having an actual browser window!

https: // github.com / https: // assets-cdn.github.com / assets / frameworks-12d63ce1986bd7fdb5a3f4d944c920cfb75982c70bc7f75672f75dc7b0a5d7c3.css https: // assets-cdn.github.com / assets / github-2826bd4c6eb7572d3a3e9774d7efe010d8de09ea7e2a559fa4019baeacf43f83.css https: // assets-cdn.github.com / assets / site-f4fa6ace91e5f0fabb47e8405e5ecf6a9815949cd3958338f6578e626cd443d7.css https: // assets-cdn.github.com / images / modules / site / home-illo-conversation.svg https: // assets-cdn.github.com / images / modules / site / home-illo-chaos.svg https: // assets-cdn.github.com / images / modules / site / home-illo-business.svg https: // assets-cdn.github.com / images / modules / site / integrators / slackhq.png https: // assets-cdn.github.com / images / modules / site / integrators / zenhubio.png https: // assets-cdn.github.com / images / modules / site / integrators / travis-ci.png https: // assets-cdn.github.com / images / modules / site / integrators / atom.png https: // assets-cdn.github.com / images / modules / site / integrators / circleci.png https: // assets-cdn.github.com / images / modules / site / integrators / codeship.png https: // assets-cdn.github.com / images / modules / site / integrators / codeclimate.png https: // assets-cdn.github.com / images / modules / site / integrators / gitterhq.png https: // assets-cdn.github.com / images / modules / site / integrators / waffleio.png https: // assets-cdn.github.com / images / modules / site / integrators / heroku.png https: // assets-cdn.github.com / images / modules / site / logos / airbnb-logo.png https: // assets-cdn.github.com / images / modules / site / logos / sap-logo.png https: // assets-cdn.github.com / images / modules / site / logos / ibm-logo.png https: // assets-cdn.github.com / images / modules / site / logos / google-logo.png https: // assets-cdn.github.com / images / modules / site / logos / paypal-logo.png https: // assets-cdn.github.com / images / modules / site / logos / bloomberg-logo.png https: // assets-cdn.github.com / images / modules / site / logos / spotify-logo.png https: // assets-cdn.github.com / images / modules / site / logos / swift-logo.png https: // assets-cdn.github.com / images / modules / site / logos / facebook-logo.png https: // assets-cdn.github.com / images / modules / site / logos / node-logo.png https: // assets-cdn.github.com / images / modules / site / logos / nasa-logo.png https: // assets-cdn.github.com / images / modules / site / logos / walmart-logo.png https: // assets-cdn.github.com / assets / compat-8a4318ffea09a0cdb8214b76cf2926b9f6a0ced318a317bed419db19214c690d.js https: // assets-cdn.github.com / assets / frameworks-6d109e75ad8471ba415082726c00c35fb929ceab975082492835f11eca8c07d9.js https: // assets-cdn.github.com / assets / github-5d29649478f4a2b05588bbd0d25cd56ff5445b21df31b4cccca942ad8687e1e8.js https: // assets-cdn.github.com / images / modules / site / heroes / home-code-bg-alt-01.svg https: // assets-cdn.github.com / static / fonts / roboto / roboto-light.woff https: // assets-cdn.github.com / static / fonts / roboto / roboto-regular.woff https: // assets-cdn.github.com / static / fonts / roboto / roboto-medium.woff

This is great to see the assets that might be loaded, but what about if we want to walk the DOM for elements that exist in the page? We could use a script like this which pulls out all of the image tags from Github.com:

const CDP = require ( "chrome-remote-interface" ) ; CDP ( chrome => { chrome. Page . enable ( ) . then ( ( ) => { return chrome. Page . navigate ( { url : "https://github.com" } ) ; } ) . then ( ( ) => { chrome. DOM . getDocument ( ( error , params ) => { if ( error ) { console. error ( params ) ; return ; } const options = { nodeId : params. root . nodeId , selector : "img" } ; chrome. DOM . querySelectorAll ( options , ( error , params ) => { if ( error ) { console. error ( params ) ; return ; } params. nodeIds . forEach ( nodeId => { const options = { nodeId : nodeId } ; chrome. DOM . getAttributes ( options , ( error , params ) => { if ( error ) { console. error ( params ) ; return ; } console. log ( params. attributes ) ; } ) ; } ) ; } ) ; } ) ; } ) ; } ) . on ( "error" , err => { console. error ( err ) ; } ) ;

You’ll see that we can get the following data structure representing the tags in the page including the urls of the image.

[ 'src' , 'https://assets-cdn.github.com/images/modules/site/home-illo-conversation.svg' , 'alt' , '' , 'width' , '360' , 'class' , 'd-block width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/home-illo-chaos.svg' , 'alt' , '' , 'class' , 'd-block width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/home-illo-business.svg' , 'alt' , '' , 'class' , 'd-block width-fit mx-auto mb-4' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/slackhq.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/zenhubio.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/travis-ci.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/atom.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/circleci.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/codeship.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/codeclimate.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/gitterhq.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/waffleio.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/integrators/heroku.png' , 'alt' , '' , 'class' , 'd-block integrations-collage-img width-fit mx-auto' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/airbnb-logo.png' , 'alt' , 'Airbnb' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/sap-logo.png' , 'alt' , 'SAP' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/ibm-logo.png' , 'alt' , 'IBM' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/google-logo.png' , 'alt' , 'Google' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/paypal-logo.png' , 'alt' , 'PayPal' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/bloomberg-logo.png' , 'alt' , 'Bloomberg' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/spotify-logo.png' , 'alt' , 'Spotify' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/swift-logo.png' , 'alt' , 'Swift' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/facebook-logo.png' , 'alt' , 'Rails' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/node-logo.png' , 'alt' , 'Node' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/nasa-logo.png' , 'alt' , 'Nasa' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ] [ 'src' , 'https://assets-cdn.github.com/images/modules/site/logos/walmart-logo.png' , 'alt' , 'Walmart' , 'class' , 'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]

Happy scraping!