Create e-book from website with ruby April 22, 2015 posted in ruby, pro tip.

I’m going to spend next two weeks without internet and I want to catch up with some reading. There is a few websites with articles I’d like to read, so I decided to create an e-book out of them.

Those articles have one thing in common — an archive that provides a list of them. I could chuck them into a Pocket, but it wouldn’t be too much fun and it would involve lots of clicking. Let’s use a ruby instead.

First thing to do is scraping that list of links that I’m interested in. Why not do it for this blog.

require 'nokogiri' require 'open-uri' archive_url = "https://chodounsky.com/archive/" link_selector = ".content .archive li a" domain = "https://chodounsky.com" archive = Nokogiri::HTML(open(archive_url)) links = archive.css(link_selector).map { |a| domain + a["href"] }.reverse

We used nokogiri for parsing archive page and selected all the links to articles with a simple CSS selector. You can be more creative depending on the page structure or your needs — as the archive might be spread across multiple pages or you want a specific order or filtering, but for this example we’ll keep it simple and only reverse it to start with the oldest articles.

After that, we’ll create a simple data container for storing an article — our new book chapter. It would be able to give me an id and format itself to HTML string, which we’ll save to the file later.

class Chapter attr_accessor :title, :content def initialize(title, content) @title = title @content = content end def id title.downcase.gsub(" ", "_").gsub(/[^0-9a-z_]/i, '') end def to_s <<-eos <?xml version='1.0' encoding='utf-8'?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> <title>#{title}</title> <style> img { max-width: 95%; } </style> </head> <body> #{content} </body> </html> eos end end

Next step is to scrape the content we are interested in. It’s an action, so we’ll wrap it into a service object.

class DownloadChapter attr_reader :index attr_accessor :article_selector, :title_selector, :configuration def initialize @index = 0 yield(self) end def call(url) html = raw_html(url) title = html.css(title_selector).text article = html.css(article_selector).first images = download_images_from(article) content = replace_images(article, images).to_s save(Chapter.new(title, content)) @index += 1 end private def replace_images(article, images) article.css("img").each_with_index { |img, index| img["src"] = images[index] } article.css("a").each { |a| a.replace(a.children) } article end def domain @domain ||= configuration.fetch(:domain, "") end def raw_html(url) html = Nokogiri::HTML(open(url)) configuration[:normalize].call(html) if configuration[:normalize] html end def download_images_from(html) html.css("img").map do |img| url = domain + img["src"] filename = filename_prefix + "_" + id + "_" + url.split("/").last open("#{filename}", 'wb') { |file| file << open(url).read } filename end end def filename_prefix '%03d' % @index end def save(chapter) File.open("#{filename_prefix}-#{chapter.id}.html", "w") { |f| f.write(chapter.to_s) } end end

This service is the most complex piece of this small ruby script. We want to download the appropriate content, that means text and images. On the other hand we want to skip comments, ads, sidebars and other distracting elements — that’s when the normalization method kicks in, but more about that later. Article content and title is defined with CSS selectors and we have to provide URL to scrape from.

This service has multiple responsibilities ranging from downloading the images to saving the output to the file, but we are going to keep it in one class for the sake of simplicity. If you intend to do some serious programming I recommend splitting responsibilities apart into separate classes.

Let’s move the final step and tie everything together.

download_chapter = DownloadChapter.new do |d| d.article_selector = ".post" d.title_selector = ".post h1" d.configuration = { domain: domain, normalize: -> (article) { article.css("footer").each { |node| node.remove } } } end links.each do |url| download_chapter.call(url) end

Firstly, we created the service and passed it a configuration. The configuration contains the domain name and the normalization method. This method is important for stripping the content that we are not interested in. In this case it removes comment section, and you can remove anything by matching CSS selectors.

Last three lines are calling the service for each link we scraped from the archive page. This whole script generates HTML files with properly linked images inside your folder.

You might wonder where we create the final product — an actual e-book. I have a nook so my preferred format is epub . It is a zipped archive of html pages with bunch of files under certain hierarchy. There is a few ruby gems that exports content into it, but I didn’t find any one of them to be convenient enough to produce nice and compatible results with my reader.

But there is an excellent e-book creator called Sigil which you can produce beautiful e-books with tables of content and title pages really really easily. I highly recommend it as it is also available for all operating systems.

Oh, and if you own a Kindle and you are after the mobi format, don’t despair. Epub and mobi are convertible between each other and you can use calibre for that job.

Would you like to get top 5 links on Programming every Monday?

Sign up to Programming Digest and stay up to date!