Scraping a Web Page or Writing Automated System Tests Using Ruby

I have been asked so many times to develop a script that would scrape a specific Web page. It seems that it is a very useful, and needed, technique when one wants to get data from a Web page and that Web page does not offer any export functionality, nor an API for data to be fetched. Or, equally valid and useful when you want to write Quality Assurance or Test Automation scripts that make sure a Web page behaves as designed.

Disclaimer: Public Web sites: Whether it is legal or not to extract data from a public Web page using automated scripts is out of scope of this blog post here. You can read many articles on the Web about this issue. I suggest that you first read the Terms/Conditions of Service of the entity that owns that Web page. These might tell you whether you are allowed to do it or not. If you don't find that information there, you might want to contact the entity directly. If you can't get an answer whether they allow you to scrape it or not, then you better not. IMHO, here is a very good article on whether it is legal or not.

Source Code

The source code for this blog post can be found here.

Tools

Which tools can one use in order to write such a script? There are plenty out there. We are going to use Ruby and a couple of Ruby libraries that ease our work in parsing the DOM of the HTML page.

Since this is going to be a hands-on blog post, let's start with a new Ruby project

Create a New Ruby Project

I will create a new Ruby project using RubyMine, which is my favourite IDE for Ruby development. On the new project, which is backed up by rvm for Ruby Version Management, I will create a Gemfile with the gems that I will need for this project.

Let's start with Open URI

Open URI belongs to the Ruby standard library and it is a wrapper around Net HTTP, another Ruby standard library. Open URI has a very easy interface that will allow us to fetch the contents of a page.

Create the file students_to_csv.rb and put the following content into it:

1 . require 'open-uri' 2 . 3 . PAGE_TO_VISIT = 'http://fakedata.techcareerbooster.com/exercises_and_code/students' 4 . 5 . content = open ( PAGE_TO_VISIT ) . read 6 .

The above code fetches the contents of the page http://fakedata.techcareerbooster.com/exercises_and_code/students and stores them as a string into the variable content .

Here is a screenshot of the page that this program fetches back:

As you can see, the page contains a table of student entries. What we want to do is to get this table into a CSV file.

If you try the above program and then inspect the contents of the content variable, you will see a long string like this:

1. <!DOCTYPE html > 2. < html lang =' en ' > 3. < head > 4. < meta content =' text/html; charset=UTF-8 ' http-equiv =' Content-Type ' > 5. < title > TechcareerboosterFakeData </ title > 6. < meta name =" csrf-param " content =" authenticity_token "/> 7. < meta name =" csrf-token " 8. content =" G/gyDp2jSF7k/4tKmD8EQbQEyekKEv0mlvtH0uBdj1ZI6BqNY0k6jNaj7QZZWEA50yGH/8b6BcSFRa9Fx35Ekg== "/> 9. < link rel =" stylesheet " media =" all " 10. href =" /assets/application-40cc2c09383c6a6c3155e6c4611ff7d577685f2578263e33752a3c988738598c.css "/> 11. < script src =" /assets/application-c1b87bfdb7d9880e1e46539da912553c4efd1966a6f93ccec006aafb63416815.js " > </ script > 12. </ head > 13. < body > 14. < nav class =' navbar navbar-inverse navbar-fixed-top ' > 15. < div class =' container ' > 16. < div class =' navbar-header ' > 17. < button class =' navbar-toggle collapsed ' data-aria-controls =' navbar ' data-aria-expanded =' false ' 18. data-target =' #navbar ' data-toggle =' collapse ' type =' button ' > 19. < span class =' sr-only ' > Toggle navigation </ span > 20. < span class =' icon-bar ' > </ span > 21. < span class =' icon-bar ' > </ span > 22. < span class =' icon-bar ' > </ span > 23. </ button > 24. < a class =" navbar-brand " href =" / " > Fake Data </ a > 25. </ div > 26. < div class =' collapse navbar-collapse ' id =' navbar ' > 27. < ul class =' nav navbar-nav ' > </ ul > 28. </ div > 29. </ div > 30. </ nav > 31. 32. < div class =' container ' > 33. < h1 > Students </ h1 > 34. < ul class =" pagination " > 35. 36. 37. < li class =" page active " > 38. < a href =" /exercises_and_code/students " > 1 </ a > 39. </ li > 40. 41. < li class =" page " > 42. < a rel =" next " href =" /exercises_and_code/students?page=2 " > 2 </ a > 43. </ li > 44. 45. < li class =" page " > 46. < a href =" /exercises_and_code/students?page=3 " > 3 </ a > 47. </ li > 48. 49. < li class =" page " > 50. < a href =" /exercises_and_code/students?page=4 " > 4 </ a > 51. </ li > 52. 53. < li class =" page " > 54. < a href =" /exercises_and_code/students?page=5 " > 5 </ a > 55. </ li > 56. 57. < li class =" page gap disabled " > < a href =" # " onclick =" return false; " > … </ a > </ li > 58. < li class =" next_page " > 59. < a rel =" next " href =" /exercises_and_code/students?page=2 " > Next › </ a > 60. </ li > 61. 62. < li class =" last next " > < a href =" /exercises_and_code/students?page=10 " > Last » </ a > 63. </ li > 64. 65. </ ul > 66. 67. < table class =' table table-striped table-bordered table-hover table-condensed ' > 68. < thead > 69. < tr > 70. < th > ID </ th > 71. < th > Name </ th > 72. < th > Email </ th > 73. < th colspan =' 4 ' > Address </ th > 74. </ tr > 75. </ thead > 76. < tbody > 77. < tr > 78. < td > 99a636da1e333a49c996512ca9a52929 </ td > 79. < td > Mrs. Hilton Emmerich </ td > 80. < td > desmond.tillman@ernserhackett.info </ td > 81. < td > 3640 Rudolph Rest </ td > 82. < td > South Halleside </ td > 83. < td > 63062-1867 </ td > 84. < td > Mongolia </ td > 85. </ tr > 86. < tr > 87. < td > fac3f742b5ad0e8663df2920632af4e7 </ td > 88. < td > Robb Wolf </ td > 89. < td > alexandre@schmeler.info </ td > 90. < td > 40214 Anderson Island </ td > 91. < td > Hansenburgh </ td > 92. < td > 30386-4403 </ td > 93. < td > Romania </ td > 94. </ tr > 95. < tr > 96. < td > 7931a70d4f7573974a1138029e7fc1a0 </ td > 97. < td > Marshall Windler Jr. </ td > 98. < td > dandre_feil@brekke.info </ td > 99. < td > 2795 Shayne Crest </ td > 100. < td > Denesikburgh </ td > 101. < td > 10464 </ td > 102. < td > Anguilla </ td > 103. </ tr > 104. < tr > 105. < td > 9d26a9b347235d8723b9623972e614ee </ td > 106. < td > Ms. Fernando Hermann </ td > 107. < td > abigail@carroll.com </ td > 108. < td > 777 Tara Springs </ td > 109. < td > Margaretberg </ td > 110. < td > 34350 </ td > 111. < td > Israel </ td > 112. </ tr > 113. < tr > 114. < td > 6d96472ebf067072342adf822ffc36cb </ td > 115. < td > Ms. Brandyn Jerde </ td > 116. < td > felicia_bosco@mullertromp.co </ td > 117. < td > 759 Jed Locks </ td > 118. < td > Cierrahaven </ td > 119. < td > 34068-0207 </ td > 120. < td > Brazil </ td > 121. </ tr > 122. < tr > 123. < td > b3a1b60c368cb589f5f3dda92136a9f6 </ td > 124. < td > Leila McClure </ td > 125. < td > jonatan_wyman@bernier.info </ td > 126. < td > 49623 Wiza Square </ td > 127. < td > Lake Felixburgh </ td > 128. < td > 45008 </ td > 129. < td > Albania </ td > 130. </ tr > 131. < tr > 132. < td > bde0f9d0fb76a94b776bc9327735119c </ td > 133. < td > Mateo Hackett MD </ td > 134. < td > willis_botsford@oreilly.com </ td > 135. < td > 86636 Johns Road </ td > 136. < td > Angelinafurt </ td > 137. < td > 19919 </ td > 138. < td > French Guiana </ td > 139. </ tr > 140. < tr > 141. < td > 6891494222a1d03b61b37964116e1796 </ td > 142. < td > Mrs. Lionel Jerde </ td > 143. < td > maymie.prosacco@bauchhowe.com </ td > 144. < td > 624 Glover Lights </ td > 145. < td > Kiehnville </ td > 146. < td > 73039-2529 </ td > 147. < td > Egypt </ td > 148. </ tr > 149. < tr > 150. < td > d40c23d2007656cce22232ab0ac91f61 </ td > 151. < td > Mr. Maynard Roberts </ td > 152. < td > carleton_koepp@heel.org </ td > 153. < td > 15058 Winston Mission </ td > 154. < td > Berniermouth </ td > 155. < td > 78170 </ td > 156. < td > Paraguay </ td > 157. </ tr > 158. < tr > 159. < td > 60614ade94b3bb069e6c545a6c96ed26 </ td > 160. < td > Nyah Daugherty </ td > 161. < td > emie.welch@christiansen.net </ td > 162. < td > 8869 Destiney Crossroad </ td > 163. < td > North Rafael </ td > 164. < td > 94839 </ td > 165. < td > Netherlands </ td > 166. </ tr > 167. </ tbody > 168. </ table > 169. 170. < footer > 171. < a href =" https://www.techcareerbooster.com " > Courtesy of Tech Career Booster </ a > 172. </ footer > 173. </ div > 174. </ body > 175. </ html >

As we have the source of the page at hand, we are now able to parse the lines from 67 up to 168 with the students data. We can definitely do that with custom string parsing code, but there is a much better alternative, which is called, Nokogiri .

Let's Install Nokogiri

Nokogiri is a very useful Ruby library that can help you parse HTML and XML content. Let's add that to our Gemfile and then do bundle .

1 . # Gemfile 2 . # 3 . source 'https://rubygems.org' 4 . 5 . gem 'nokogiri' 6 .

Parsing With Nokogiri

With Nokogiri installed, the parsing of the HTML string is an easy task. You only need to know some basic commands of the Nokogiri library, and use your CSS selector knowledge to select the elements of the DOM that you want to get content from.

Here is the students_to_csv.rb code that parses the content and generates the CSV file students.csv :

1 . require 'open-uri' 2 . require 'nokogiri' 3 . require 'csv' 4 . 5 . PAGE_TO_VISIT = 'http://fakedata.techcareerbooster.com/exercises_and_code/students' 6 . 7 . content = open ( PAGE_TO_VISIT ) . read 8 . 9 . html_doc = Nokogiri :: HTML ( content ) 10 . 11 . csv_headers = %w[ id name email street zip_code country ] 12 . 13 . CSV . open ( 'students.csv' , 'wb' ) do | csv | 14 . csv << csv_headers 15 . 16 . html_doc . css ( 'table tbody tr' ) . each do | student_row | 17 . id = student_row . at_css ( 'td:nth-child(1)' ) . content 18 . 19 . name = student_row . at_css ( 'td:nth-child(2)' ) . content 20 . 21 . email = student_row . at_css ( 'td:nth-child(3)' ) . content 22 . 23 . street = student_row . at_css ( 'td:nth-child(4)' ) . content 24 . 25 . zip_code = student_row . at_css ( 'td:nth-child(5)' ) . content 26 . 27 . country = student_row . at_css ( 'td:nth-child(6)' ) . content 28 . 29 . csv << [ id , name , email , street , zip_code , country ] 30 . end 31 . end

Let's see how Nokogiri is helping us here to extract the students' details from the page with the table of students:

On line 9, we tell Nokogiri that we have an HTML document stored inside the content variable. On line 16, we use the method #css that returns back all the elements that match the selector given. With the #css('table tbody tr') we get back an array of Nokogiri elements that we can process one-by-one. These are the rows of the table with the students data. On line 16, we start the loop of processing each row of the table. The method #at_css is similar to #css but we use it whenever we know that the returned element is one. So, we have access to the returned element immediately. Then we call #content and we get the content of the element. In the loop, we use this technique so that we get the content of each td element of each tr of table . Finally, on line 29, we write the data to the CSV file.

Pretty simple, isn't it?

If you run the above program, you will get a students.csv file in the root folder of your project. The file will contain data like this:

id,name,email,street,zip_code,country 433c0bfbc9b3f1602b468e86f9028753,Claire Mante II,lucinda.marks@mohrcrooks.com,326 Tyrese Wall,Robertside,74642 c122449fc335cbbe72be4131d40e15ad,Rogers Morissette,dorian@smith.co,49920 Bruen Haven,Haneside,44515-3732 99cd0bc453d36844afb6ae634dbbf005,Quentin Tillman,elvis.powlowski@funk.biz,457 Roy Trace,Sporerland,39794-6014 ff28eaf8f19a07baa074017f83c896a3,Claudie Ondricka,sven@thompson.name,8771 Cecile Isle,McCluremouth,49402 cae2e353ee3b144fa04a91f957ab6d31,Eldora Volkman,vincent@osinskideckow.net,30992 Marcos Well,South Paigeburgh,88045-6077 514e9ee2b98c08f0a5f8e0d1187746bc,Elva Rowe PhD,louie.macgyver@ko.co,52534 Barrows Valley,New Cletastad,69315-7244 708a6635f35619b7bd4c738029991a76,Dr. Katlyn Bergstrom,weldon@mohrhane.io,988 Damian Union,Mariefurt,26071-5244 29e97bedd258c27a2b07e034488552d5,Hayden Hansen,derrick@boyerolson.org,6785 Becker Road,Port Allan,94317 fb5814f725d610e49231ca437b3b2e9d,Kelley Lebsack,amara@runolfsdottir.net,9005 Predovic Divide,West Sylviamouth,65209 33ce08d9083bb7b5432b84060762c203,Darren Sporer,kyleigh@hirthe.io,63481 Goodwin Meadows,Medhurstside,12432

Note: The data the page returns are random fake data

The limitation of the students_to_csv.rb program, as it is now, it is that we only download the first page of the data. Is there a way that we can download the other pages too?

This is not difficult as long as we fetch next pages with the same technique. We only have to amend the URL with the page query param and repeat until we get no data.

Here is the new version of the program:

1 . require 'open-uri' 2 . require 'nokogiri' 3 . require 'csv' 4 . 5 . PAGE_TO_VISIT = 'http://fakedata.techcareerbooster.com/exercises_and_code/students' 6 . 7 . content = open ( PAGE_TO_VISIT ) . read 8 . 9 . html_doc = Nokogiri :: HTML ( content ) 10 . 11 . csv_headers = %w[ id name email street zip_code country ] 12 . 13 . CSV . open ( 'students.csv' , 'wb' ) do | csv | 14 . csv << csv_headers 15 . 16 . page = 1 17 . 18 . begin 19 . student_rows = html_doc . css ( 'table tbody tr' ) 20 . student_rows . each do | student_row | 21 . id = student_row . at_css ( 'td:nth-child(1)' ) . content 22 . 23 . name = student_row . at_css ( 'td:nth-child(2)' ) . content 24 . 25 . email = student_row . at_css ( 'td:nth-child(3)' ) . content 26 . 27 . street = student_row . at_css ( 'td:nth-child(4)' ) . content 28 . 29 . zip_code = student_row . at_css ( 'td:nth-child(5)' ) . content 30 . 31 . country = student_row . at_css ( 'td:nth-child(6)' ) . content 32 . 33 . csv << [ id , name , email , street , zip_code , country ] 34 . end 35 . 36 . page += 1 37 . 38 . content = open ( " #{ PAGE_TO_VISIT } ?page= #{ page } " ) 39 . html_doc = Nokogiri :: HTML ( content ) 40 . end while student_rows . count > 0 41 . end

Run the above program, and you will see that you get 100 students in your CSV file.

Form Submission Using Capybara

Visiting a page and scraping the content of that page is very easy with Nokogiri. But what about form submissions? What if we want, for example, to automatically test that our Web page, that includes a form for the user to submit, works as expected?

Here is where Capybara comes into play. It is a very useful tool that allows us to parse the content of a page, as well as to fill in and submit a form.

Let's work with Capybara to submit data on the following Student Registration Form

Install Capybara and Selenium Webdriver

First add capybara and selenium-webdriver gem references into your Gemfile and do bundle . You will not need to have nokogiri explicitly referenced in your Gemfile any more. capybara depends on nokogiri and hence nokogiri will be included in the bundle anyway.

# Gemfile # source 'https://rubygems.org' gem 'capybara' gem 'selenium-webdriver'

The gem selenium-webdriver is needed in order to be able to access remote servers like the page we are going to access.

Script To Fill In Form

Let's confirm that everything is set up correctly. We will start with a first version of the script as follows:

1 . require 'capybara/dsl' 2 . 3 . include Capybara :: DSL 4 . 5 . Capybara . current_driver = Capybara . javascript_driver 6 . Capybara . app_host = 'http://fakedata.techcareerbooster.com' 7 . 8 . visit '/exercises_and_code/students/new' 9 . 10 . save_screenshot 'screenshot.png' 11 .

Before your run the above script, here are some notes about i:

On line 1, we require the capybara/dsl file. This, together with the line 3, on which we include Capybara::DSL , will allow us to use the Capybara DSL, i.e. the Capybara methods for parsing the page and submitting the form data. On line 5, we tell Capybara to use its default JavaScript driver in order to interact with remote pages. Otherwise, Capybara uses driver :rack_test which is useful when writing automated tests for Rack applications. On line 6, we tell Capybara who is going to be the remote host.

That finishes the setup. Then we can start using the DSL to interact with the remote page.

On line 8, we call #visit which takes the path we want to visit. And on line 10, we can use, for debugging reasons, the #save_screenshot command, that takes as argument a filename and generates the screenshot of the page that script is at.

Try to run the above program and you will see the screenshot file screenshot.png generated in the folder of the project. It will be this:

Everything seems to be working fine. Now, let's use Capybara DSL to fill in the student registration form and submit it. Here is the new version of the student_registration_form.rb file:

1 . require 'capybara/dsl' 2 . 3 . include Capybara :: DSL 4 . 5 . Capybara . current_driver = Capybara . javascript_driver 6 . Capybara . app_host = 'http://fakedata.techcareerbooster.com' 7 . 8 . visit '/exercises_and_code/students/new' 9 . 10 . fill_in 'Name' , with : 'John' 11 . fill_in 'Email' , with : 'john@mailinator.com' 12 . fill_in 'Street number' , with : '46232 Connelly Neck' 13 . fill_in 'City' , with : 'North Maudie' 14 . fill_in 'Zip code' , with : '62178' 15 . select 'Bolivia' , from : 'Country' 16 . 17 . click_button 'Create Student' 18 . 19 . sleep 3 20 . 21 . save_screenshot 'screenshot.png'

Between lines 10 and 15, we use Capybara DSL methods that are used to fill in a form. When you have to fill in an input, the #fill_in method is the one to use. When you select an option from a select box, then use #select method. You can read more about these group of commands here.

On line 17, we click the button Create Student with the Capybara method #click_button .

Before we save the screenshot, we just sleep for 3 seconds, only to give some time for the new page to appear.

If you run the program above, you will have a new screenshot.png file in your project folder. That file would be something like this:

The above confirms that we have done the work correctly!

But, instead of having screenshots to confirm that everything is ok, we can use Capybara DSL to confirm that the success message exists on the page displayed. Here is the new version of the script that does not use a screenshot any more. Note that sleep is not necessary here, because Capybara will wait anyway for the success message to appear.

1 . require 'capybara/dsl' 2 . 3 . include Capybara :: DSL 4 . 5 . Capybara . current_driver = Capybara . javascript_driver 6 . Capybara . app_host = 'http://fakedata.techcareerbooster.com' 7 . 8 . visit '/exercises_and_code/students/new' 9 . 10 . fill_in 'Name' , with : 'John' 11 . fill_in 'Email' , with : 'john@mailinator.com' 12 . fill_in 'Street number' , with : '46232 Connelly Neck' 13 . fill_in 'City' , with : 'North Maudie' 14 . fill_in 'Zip code' , with : '62178' 15 . select 'Bolivia' , from : 'Country' 16 . 17 . click_button 'Create Student' 18 . 19 . find ( '.alert-success' , text : 'Right! Student has been registered' )

The last command find tries to locate the flash message that appears on the page to confirm that the registration was successful. This is enough to make sure that we have done the work. And, again, we don't have to include the sleep statement.

Capybara For Scraping

Can I use Capybara instead of Nokogiri for scraping a Web page? Yes and, actually, it is recommended when the Web page includes JavaScript generated content.

Closing Note

Nokogiri, Capybara, Selenium-Webdriver are useful tools to write automated tests as well as to scrape Web pages (whenever you are allowed to do so, of course). Famous test automation tools, like RSpec and Cucumber, and Web frameworks like Ruby on Rails integrate seamlessly with Capybara and you will definitely meet these tools if you happen to land a job in the Ruby world.

Please share your thoughts in the comments section below. Let me know, as I always learn a lot from you.

Finally, I want to mention here that on our Full Stack Web Developer course we teach test automation with all these tools. This is a Mentor supported course that you pay-as-you-go. Your Mentor is assigned to you and evaluates your work and your progress, making sure that you improve on every step that you take.