I’m currently working on a side project where I want to scrape and store the blog posts on certain pages. For this project I chose to use NodeJS. I have been working more with javascript lately so I figured this would be a fun thing to do with Node instead of Ruby, Python, whatever.

The tooling

There are two really great tools to use when scraping websites with NodeJs: Axios and Cheerio

Using these two tools together, we can grab the HTML of a web page, load it into Cheerio (more on this later), and query the elements for the information we need.

Axios

Axios is a promise based HTTP client for both the browser, and for NodeJS. This is a well known package that is used in tons and tons of projects. Most of the React and Ember projects I work on use Axios to make API calls.

We can use axios to get the HTML of a website:

import axios from 'axios' ; await axios . get ( 'https://www.realtor.com/news/real-estate-news/' ) ;

☝️ will give us the HTML of the URL we request.

Cheerio

Cheerio is the most amazing package I never heard of until now. Essentially, Cheerio gives you jQuery-like queries on the DOM structure of the HTML you load! Its amazing and allows you to do things like this:

const cheerio = require ( 'cheerio' ) const $ = cheerio . load ( '<h2 class="title">Hello world</h2>' ) const titleText = $ ( 'h2.title' ) . text ( ) ;

If you’re at all familiar with JS development, this should feel very familiar to you.

The final Script

With Axios and Cheerio, making our NodeJS scraper is dead simple. We call a URL with axios, and load the output HTML into cheerio. Once our HTML is loaded into cheerio, we can query the DOM for whatever information we want!

import axios from 'axios' ; import cheerio from 'cheerio' ; export async function scrapeRealtor ( ) { const html = await axios . get ( 'https://www.realtor.com/news/real-estate-news/' ) ; const $ = await cheerio . load ( html . data ) ; let data = [ ] ; $ ( '.site-main article' ) . each ( ( i , elem ) => { if ( i <= 3 ) { data . push ( { image : $ ( elem ) . find ( 'img.wp-post-image' ) . attr ( 'src' ) , title : $ ( elem ) . find ( 'h2.entry-title' ) . text ( ) , excerpt : $ ( elem ) . find ( 'p.hide_xxs' ) . text ( ) . trim ( ) , link : $ ( elem ) . find ( 'h2.entry-title a' ) . attr ( 'href' ) } ) } } ) ; console . log ( data ) ; }

The output

We now have our scrapped information!