This article is an educational piece ONLY. If you enjoy the content provided by the New York Times – please SUBSCRIBE TO THEIR SERVICE. You should NOT use this knowledge to read the new york times for free forever.

My friend group shares a lot of New York Times articles. Like, a lot of them. And like I’m sure other developers have done in the past, I often hit a paywall, and then simply use developer tools (because the paywall is run completely through javascript…. 😱), or the old Incognito Mode trick to get around the pay wall for articles. But recently I’ve been wondering if there is a more interesting way to go about this. If Google can see the full article, and if the full article is on the page behind all that client side javascript, then all it should really take is a scraper or web crawler to get the article – the same way as you would manually on your own. So I decided to test the theory.

Some basics of building a scraper

So to start, I set up my basic scraper stack. NodeJS with the request module, and Cheerio.

In terminal:

npm init npm i --save request cheerio

Request lets you make an HTTP request to another page, similar to how a browser would – and Cheerio acts as a server-side JQuery to easily query against the document retrieved.

The plan? Simple. Request data from the article we want, and scrape out the contents of the article. I also wanted to present this in a readable format though. For that, it seemed appropriate for this micro-app to also run a web server.

Express seemed like the easiest choice, so I went that route – and threw EJS along with it for easy templating of my reader page.

In terminal:

npm i --save express ejs

With this, we now have the request module in our app for requesting from NYT. We also have Cheerio for querying the dom. And we have Express and EJS to serve up our own version of the article. A little overkill – but super convenient for running a local web server to read articles when needed.

And one last thing before we jump into it – let’s create two files. Our main application file, which we can call index.js (which will run our server and the basic scrape logic), as well as our view file, article.ejs , which will house the HTML/CSS/Etc for the page where we will read the article.

You can create these files through your IDE, or in terminal – but since we’re already in Terminal – let’s just do it there.

In Terminal:

mkdir views touch views/article.ejs touch index.js

Article Structure & View File

If you look at any NYT article, you will notice they do a pretty good job of following solid SEO standards. This actually makes our job a little easier. First – We can assume the H1 on the article page is the article title. Every article I pulled up followed this same standard. We’ll come back to this discovery in a second.

Similarly – the article appears to be wrapped in its own parent element. While this one is less identifiable, it still is given a name attribute of articleBody .

New York Times H1` Tag used for the article title.

New York Times Article Wrapper

This new understanding of NYT’s on-page structure lets us safely assume that the makeup of the average article page will need two variables. A title, using the contents of the article’s H1 tag, and article data using the inner HTML of the articleBody . For readability, we’ll also add bootstrap from a CDN.

Given this info, we can now structure out our view file with a title, and some data.

In views/article.ejs

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Bootstrap 101 Template</title> <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" crossorigin="anonymous"> </head> <body> <div class="container"> <h1><%- title %></h1> <article> <%-data%> </article> </div> </body> </html>

Prepping the Node App

With our fancy new view file set up, let’s structure our main application file. We are going to want to load in our dependencies from above, and set up our view engine (EJS). We’ll also create a simple listener for our web server.

index.js

const express = require('express'); const request = require('request'); const cheerio = require('cheerio'); const app = express(); app.set('view engine', 'ejs'); app.get('/', function(req, res){ res.send("I'm alive!"); }); /* LISTEN ON PORT */ app.listen('8081') console.log('NYT Started on Port 8081'); exports = module.exports = app;

If you save this file, and then run it using node index.js – you should see the output “NYT Started on Port 8081”. Load up your web browser of choice, and navigate to http://localhost:8081/. You should see your server respond with “I’m alive!”. Good job!

Scraping the Data, and Serving the View

For the sake of simplicity, we want to be able to copy paste an article into this app as quickly as possible. Let’s make some modifications to our main route (The one currently stating that it is alive) – and make it instead fetch that NYT article for us.

The flow will go something like this:

Take a New York Times URL into the app, passed through a URL parameter.

Use the request module to get the contents of that URL

Use the Cheerio module to create a queryable object based on the retrieved markup.

Query the data that we need – and pass it to the view

Serve the view over our HTTP server

As mentioned above, we know the h1 is the title – and we know the named element “articleBody” is the content. So we can query these elements to retrieve our pieces of text required for our view.

To accomplish the above, I hacked together this quick script to replace our main route in index.js :

app.get('/', function(req, res){ let url = req.query.url ? req.query.url : null; if(url) { console.log("Scrape Request initiated: "+url); request(url, function(error, response, html){ console.log("Successfully retrieved data from "+url); let data; let title; if(!error){ var $ = cheerio.load(html); title = $('h1'); data = $("[name=articleBody]").html(); } console.log("Outputting Scrape Data for "+url); res.render('article', {data, title}); }); } else { res.json({"error": "URL Param not defined."}); } });

While this code is not exactly production safe, it works for what we need to do. As you can see, we even added some VERY simple error handling to enforce a URL parameter, as well as dodge nasty errors from request (Though this error catch could be cleaned up a lot).

The final index.js file:

const express = require('express'); const request = require('request'); const cheerio = require('cheerio'); const app = express(); app.set('view engine', 'ejs'); app.get('/', function(req, res){ let url = req.query.url ? req.query.url : null; if(url) { console.log("Scrape Request initiated: "+url); request(url, function(error, response, html){ console.log("Successfully retrieved data from "+url); let data; let title; if(!error){ var $ = cheerio.load(html); title = $('h1'); data = $("[name=articleBody]").html(); } console.log("Outputting Scrape Data for "+url); res.render('article', {data, title}); }); } else { res.json({"error": "URL Param not defined."}); } }); /* LISTEN ON PORT */ app.listen('8081') console.log('NYT Started on Port 8081'); exports = module.exports = app;

Let’s try it out!

Run your application again using node index.js . Then browse to your article of choice using your url param.



Example: http://localhost:8081/?url=https://www.nytimes.com/2020/02/29/health/coronavirus-flu.html

And that’s it! You should see your article rendered, ad free, using your local server as a proxy.

And because I can’t say it enough…

This is not intended to be an exploit to gain access to read the new york times for free. This is meant as a purely educational means of showing a use case for scraper technology.

If you enjoy the New York Times, you should subscribe. And if you ARE the New York Times, I would love to work with you to find a way to enhance your paywall.