I'm working on a project that requires scraping recipes from the front end of a variety of sites. An incredible journey! And the friend I made along the way was right beside me all along!

Most of you probably know Nokogiri as the point in your Ruby on Rails install where you take a break to get a beverage. (Why is Nokogiri bundled with every flavor of Rails on Earth? I have no idea.* It's quite large, and specialized.) But here you are, already on a first-name basis with a world-class web scraper. Want to keep tabs on something that doesn't have an RSS feed? Aggregate content from a whole category of sites? Amass training data for your latest NLP project? You've come to the right place.

Nokogiri comes with tools for scraping three main formats: CSS, HTML, and XML. I used a combination of the HTML and CSS tools. If you like Javascript, a Nokogiri object gives you the equivalent of document.querySelectorAll on documents not your own. Then you get to add as much Ruby sugar on top as you like. If that doesn't sound delicious, you are reading the wrong blog post.

fig 1. Ruby sugar, anyone?

The first thing you'll need is to request a copy of the page. I used a gem called HTTParty, for which I took the onerous step of adding 'gem httparty' to my gemfile. Requesting with HTTParty looks like this:

Tricky. Bear with me here: loading this into a Noko object is kinda gnarly.

Oh wait, that was incredibly easy. Just to be clear, we told NG to parse the HTTParty output as HTML, and store it in an instance variable. Why an instance variable? Let's take a moment to reflect on our project setup.

I recommend separating the scraper object from the model (or models) you use to store the results. In the scraper, you can keep:

scraping methods

the URL of interest

your Nokogiri page object

your scraped "seed data"

Then you can use the seed data to generate the result objects in your database. Keeping these separate means you're not wasting a bunch of database columns in either model, or giving your results methods they'll never use.

In my scraper DB objects, I kept the Noko page, recipe title, URL, yield/cook time, and arrays of strings for ingredients and instructions. Since I'm on Postgres, setting up the array attributes was a piece of cake:

$ rails g migration add_columns_to_scrapers

and in your new migration:



add_column :scrapers , :ingredients , :string , array: true , default: [] add_column :scrapers , :instructions , :string , array: true , default: []

fig 2. Cake.

If you're on another database, you might create join tables for these attributes, which is slightly less fun. But hey, you're not using Postgres, so . . tough nuts? Please be advised, I won't be sparing on the food puns.

Anyway. Now the fun part: the scraping methods!

I was worried I'd have to construct monster conditional regex statements to parse ingredients lists. Fortunately I was working with well-designed commercial apps, with nice unique classnames. So, I used Nokogiri's CSS method to scrape my data instead.

At the point after you capture your page, set a debugger so you can play with the results. OK, let's see what the whole thing looks like.

@page

fig 3. Send help pls.

Oof. This is the first "hard-looking" thing we've come across. It's really not so bad. What you're seeing is the Nokogiri object version of every single element and sub-element in the page. This is actually pretty awesome - it means you have very fine control over the results.

The recipe title was my first target. So, I opened up Chrome's Inspect Element, and got the class of the <h1> title.

There it is! The conveniently named page-title . Now we can use the .css method:

@page.css(".page-title")

Phew. Still with me? We added the . to .page-title to tell Nokogiri we want to search for that term as a class name. Well, it turns out Noko gives us a little more than we want. Here are the results of that query:

Fortunately, we can use the .text method on any of these to, well, get the text. This is a lot like using .textContent in JS.

@page.css(".page-title").text

We're getting close! All we need is Ruby's excellent .strip method, which gets rid of whitespace and newlines.

@page.css(".page-title").text.strip

Ta-da! You've just scraper-Hello-Worlded.

For all you Ruby chainsmokers like me, getting the array of ingredients is a treat.

This is much the same as the above.

css("li .component-name") gets back an array of Noko objects, which we convert to a long string with .text . Then we split on newlines, and map strip onto each item from the split array. (If that syntax looks weird to you, you can write it as .map {|i| i.strip } .) That gives us a few empty strings as well, so we select the non-empties.

Result:

Win. Now I had to parse the ingredients strings for amount, measurement, and name. That is another blog post all its own, though. For now, enjoy your newfound skillz!

fig 4. Cake.

Thanks for reading. Let me know if anything needs to be cleared up or expanded on.