Reddit represents one of the largest, and perhaps most notable, online communities today. Consistently top-10 in overall readership worldwide, Reddit has become a cornerstone for endless areas of discourse, from engaging in political debate to sharing baby kitten pictures, and everything in between. Each Reddit page represents a wealth of data through posts, comments, karma scores, and other identifiers, and through a simple Javascript program, we can extract this information and manipulate the data however we want.

Source: knowyourmeme.com

In this article, I will display a technique for using web-scraping to attain the titles and karma scores of the top posts of a given subreddit. It is my hope that readers will attain a basic understanding of using Cheerio to perform scraping functions.

To do this, we will be using Cheerio, a light and flexible API for searching and managing an HTML markup structure. Built on a subset of core jQuery, Cheerio affords users the simplicity to jump right into web scraping.

We will be using 2 packages, Cheerio (of course) and request, a simple module for making HTTP calls. Request can quickly return the raw HTML body from a given URL.

To begin, we first pass in any URL into our request call. Request can also take in a callback function for handling the resulting body. Within this callback, we load the markup into cheerio:

const request = require('request');

const cheerio = require('cheerio');

//Load HTML body into cheerio

const $ = cheerio.load(body); request(' https://www.reddit.com/r/javascript/ ', (err, res, body) => {//Load HTML body into cheerioconst $ = cheerio.load(body); //Cheerio functions })

Now comes the fun part: scraping! Our goal is to analyze the HTML body of our site and deduce what elements and attributes correspond with the data we are looking for. From here, we use Cheerio’s jQuery-like syntax to pinpoint the desired elements and parse them back into our Javascript program.

For this example, we will dive into the Javascript subreddit (https://www.reddit.com/r/javascript/). I will use Chrome’s inspect tool (shortcut Cmd + Shift + C) to find which elements represent the data we hope to retrieve. For example, to locate the element that represents a given post’s score, we find that each score is represented with in a div element with classes score and unvoted. In addition, we see that the score we are looking for is stored under the attribute title:

Note both the black comment bubble to the left and the highlighted div element on the right. The div includes a title attribute with the corresponding score.

With this information in tow, we tell Cheerio, through functions known as extractors, to traverse the HTML body and attain certain elements. Continuing our example, we add an extractor to pull every element with the classes score (we don’t need unvoted, score is specific enough). From here, we chain together another type of function known as attribute functions to pull the value of the title attribute and push each one into a predefined array:

const request = require('request');

const cheerio = require('cheerio'); let scoreArr = titleArr = [];

//Load HTML body into cheerio

const $ = cheerio.load(body); request(' https://www.reddit.com/r/javascript/ ', (err, res, body) => {//Load HTML body into cheerioconst $ = cheerio.load(body); //Scrape karma scores

$(`.score`).attr(`title`, (i, val) => {

scoreArr.push(val);

});

}); console.log(scoreArr)

//[12, 134, ...] Scores of top posts of r/Javascript at time of writing

Let’s continue with extracting the titles. From the picture below, we see that each title is an a (hyperlink) element with 4 classes. This time, the title we are seeking is located not within an attribute but inside the inner HTML of the element:

Here, we see titles are ‘a’ elements with 4 classes

To extract the inner HTML, we use the .text() function. We cannot use a callback function, so we instead use .forEach() on the resulting array to add the given titles, like so:

const request = require('request');

const cheerio = require('cheerio'); let scoreArr = titleArr = [];

//Load HTML body into cheerio

const $ = cheerio.load(body); request(' https://www.reddit.com/r/javascript/ ', (err, res, body) => {//Load HTML body into cheerioconst $ = cheerio.load(body); //Scrape karma scores

$(`.score`).attr(`title`, (i, val) => {

scoreArr.push(val);

}); //Scrape post titles

$(`a.title`).forEach((el) => {

titleArr.push(el.text());

}); }); console.log(scoreArr);

//[12, 134, ...] Scores of top posts of r/movies at time of writing

console.log(titleArr);

//["Showoff Saturday...", "Making the globe...", ...]

I personally liken scraping to detective work. Since no 2 pages are constructed the same, we need to employ a different set of extractors and attribute functions to parse together the information we want. This often involves trial-and-error to attain the exact information you’re looking for.

This example provides a straightforward structure to practice basic Cheerio functions. Cheerio includes a wealth of more complex methods of traversing the DOM. Go ahead and try web scraping your favorite sites!