But there will be many instances when you'll need to parse raw HTML. The Ruby gem Nokogiri makes reading raw HTML as easy as crack -parsed XML and JSON.

In the previous chapter , we saw how to use the web inspector to intercept raw data files. This allows us to read from them directly rather than deal with the data in HTML format.

You can also use the RestClient gem as we've done before. All the Nokogiri::HTML constructor needs is raw HTML as a string.

What open-uri does for us is encapsulate all the work of making a HTTP request into the open method, making the operation as simple as as opening a file on our own hard drive.

If the webpage is live on a remote site, like http://en.wikipedia.org/ , then you'll want to include the open-uri module, which is part of the standard Ruby distribution but must be explicitly required:

The Nokogiri::HTML construct takes in the opened file's contents and wraps it in a special Nokogiri data object.

If the webpage is stored as a file on your hard drive, you can pass it in like so:

Passing the contents of a webpage to the Nokogiri parser is not much different than opening a regular textfile.

For the remainder of this section, assume that the first two lines of every script are:

Hopefully, this step is as painless as typing gem install nokogiri . If not, start Googling for the error message that you're getting. In a later edition of this book, I'll try to go into more detail on the installation process. But for now, I'm just going to wish you godspeed on this task.

Unfortunately, it can be a pain to install because it has various other dependences, libxml2 among them, that may or may not have been correctly installed on your system.

The Nokogiri gem is a fantastic library that serves virtually all of our HTML scraping needs. Once you have it installed, you will likely use it for the remainder of your web-crawling career.

Nokogiri and CSS selectors

CSS – Cascading Style Sheets – are how web designers define the look of a group of HTML elements. It has its own syntax but can be mixed in with HTML (the typical use case, though, is to load CSS files externally from the HTML, so that web designers can work on the CSS separately).

Without CSS, this is how you would make all the <a> elements (i.e. the links) the color red on a given webpage:

<p> You can <a color="red" href="http://apple.com">click here</a> to get to Apple's website. Click here to get to <a color="red" href="http://www.microsoft.com">Microsoft</a>. <br> Or you can visit w3.org to <a color="red" href="http://www.w3.org">learn more</a> about the World Wide Web. </p>

This is the resulting effect:

You can click here to get to Apple's website. Click here to get to Microsoft.

Or you can visit w3.org to learn more about the World Wide Web.

The use of CSS allows designers to apply a style across a group of elements, thus eliminating the need to define the styles of every HTML element. This example shows how a CSS selector targets all <a> elements in a single line:

<style> a:link{ color:red } </style> <p> You can <a href="http://apple.com">click here</a> to get to Apple's website. Click here to get to <a href="http://www.microsoft.com">Microsoft</a>. <br> Or you can visit w3.org to <a href="http://www.w3.org">learn more</a> about the World Wide Web. </p>

What do style tags have to do with web scraping? Nokogiri's css method allows us to target individual or groups of HTHML methods using CSS selectors. No worries if you're not an expert on CSS. It's enough to recognize the basic syntax.

We'll be working with this simple example webpage. If you view its source, you'll see this markup:

<html> <head><title>My webpage</title></head> <body> <h1>Hello Webpage!</h1> <div id="references"> <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p> <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p> <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p> </div> <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com">YouTube</a></li> <li><a data-category="news" href="http://reddit.com">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> <p>Thank you for reading my webpage!</p> </body> </html>

A table of syntax Here's a convenient table that shows all the syntax I'll cover in this section. The columns describe: The description of the selection and the syntax to do the selection A visual depiction of which HTML elements are selected Assume that this code has been run before each of the syntax calls: require 'rubygems' require 'nokogiri' require 'open-uri' PAGE_URL = "http://ruby.bastardsbook.com/files/hello-webpage.html" Description and Syntax Selection results The <title> element page.css('title') <html> <head> <title>My webpage</title> </head> <body> <h1>Hello Webpage!</h1> <div id="references"> All <li> elements page.css('li') <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com">YouTube</a></li> <li><a data-category="news" href="http://reddit.com">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> The text of the first <li> element page.css('li')[0].text <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com"> YouTube </a></li> <li><a data-category="news" href="http://reddit.com">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> The url of the second <li> element page.css('li')[1]['href'] <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com">YouTube</a></li> <li><a data-category="news" href=" http://reddit.com ">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> The <li> elements with a data-category of news page.css("li[data-category='news']") <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com">YouTube</a></li> <li><a data-category="news" href="http://reddit.com">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> The <div> element with an id of "funstuff" page.css('div#funstuff')[0] <h1>Hello Webpage!</h1> <div id="references"> <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p> <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p> <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p> </div> <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com">YouTube</a></li> <li><a data-category="news" href="http://reddit.com">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> The <a> elements nested inside the <div> element that has an id of "reference" page.css('div#reference a') <h1>Hello Webpage!</h1> <div id="references"> <p> <a href="http://www.google.com">Click here</a> to go to the search engine Google</p> <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p> <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p> </div> <div id="funstuff"> <p>Here are some entertaining links:</p> <ul> <li><a href="http://youtube.com">YouTube</a></li> <li><a data-category="news" href="http://reddit.com">Reddit</a></li> <li><a href="http://kathack.com/">Kathack</a></li> <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li> </ul> </div> The rest of this chapter explains the selectors in a little more detail. But feel free to refer back to this table. Knowing the CSS selectors is just a matter of a little memorization

Selecting an element Simply pass the name of the element you want into the Nokogiri document object's css method: page = Nokogiri::HTML(open(PAGE_URL)) puts page.css("title")[0].name # => title puts page.css("title")[0].text # => My webpage The css method does not return the text of the target element, i.e. "My webpage" . It returns an array – more specifically, a Nokogiri data object that is a collectino of Nokogiri::XML::Element objects. These Element objects have a variety of methods, including text, which does return the text contained in the element: puts page.css("title").text # => My webpage The name method simply returns the name of the element, which we already know since we specified it in the css call: "title" . puts page.css("title")[0].name #=> title Note that even though there is only one title element, the method returns it as an array of one element, so we still need to specify the first element using array notation.

Get an attribute of an element One of the most common web-scraping tasks is extracting URL's from links, i.e. anchor tags: <a> . The attributes of an element are provided in Hash form: # set URL to point to where the page exists page = Nokogiri::HTML(open(PAGE_URL)) links = page.css("a") puts links.length # => 6 puts links[0].text # => Click here puts links[0]["href"] # => http://www.google.com Here's what that first anchor tag looks like in markup form: <a href="http://www.google.com">Click here</a>