June 9, 2014

Web Scraping With Nokogiri and Mechanize #

Ever since I can remember I have loved mountain biking. From doing Ride the Rockies with my family at the age of 8 to riding gnarly, technical downhill trails up at Whistler Bike Park, in British Columbia, Canada, I have spent a huge chunk of my life cruising on two wheels.

Near the end of last year I decided to switch from a pure consumer of bike parts to being a retailer in the industry. In a couple of weeks, I tackled all the necessary prerequisites for a wholesale account and my dream of being a retailer was a reality.

For the first couple of months sales were going well with the majority of them being online. Unfortunately like most small business, I hit my first pain point. Instead of spending my time concentrating on building my brand and advertising, I was bogged down with the menial task of updating the stock of every item in my store and attempting to add as many of the 23,000 skus available. I knew there had to be a better way.

Unlike most of their competitors, my distributor had no API. Their site is a simple ecommerce site that requires a page load for each individual products information (ex: description, MSRP, stock…).

It was time to automate this task.

I began with a simple Ruby on Rails app which was hosted on [heroku](heroku.com) to reduce web request latency. The web scraping and site navigation was handled by the gems Mechanize and Nokogiri. This scraper required two steps to fully acquire all the data needed.

In order to get to each individual product variations page (to scrape), I first needed the Item # of each variant (as shown below) on the search results page. Through some trickery, I was able to get all product groups to list in the search results.

My scraper would manually enter the URL for all 621 product group pages, parse the page and insert it them into the database.

For each product group above, my script would create one product group record with an associated product for each item within it. Below is the code that tackles this task.

def self.scrape_product_groups(pages) a = Mechanize.new # Logs into the backend to make sure I can view all details page = BTI.login(a) # Pages get passed in as an array # Allows for background jobs to process # different pages concurrently # # Ex: BTI.scrape_product_groups([1,2,3]) # would scrape pages 1,2,3 pages.to_a.each do |page_num| puts "Scraping page #{page_num}" # Loads up product page page = a.get("https://bti-usa.com/public/quicksearch/+/ ?page=#{page_num}") # Parses page from Mechanize to Nokogiri raw_xml = page.parser # Granb all product rows on the page groupRows = raw_xml.css('.groupRow') # Grabs product group 'bti_id' from css and parses out junk groupRows.each do |item| bti_id = item.attributes.first.last.value .gsub('groupItemsDiv__num_','') .gsub('groupItemsDiv_','') # Creates or finds product group based on 'bti_id' pg = ProductGroup.live.where(bti_id: bti_id).first pg ||= ProductGroup.create(bti_id: bti_id) # Parses product group name pg.name = item.css('.groupTitleOpen').text puts "Updating #{pg.name} product group" #Build up product group description from all bullet points pg.description = "" item.css('.groupBullets').css('li').each do |li| pg.description += li.text + '. ' end # Iterates through every item number in the product group item.css('.itemNo').each do |itemNo| # Finds and cleans up 'bti_id' bti_id = itemNo.css('a').text.gsub('-','') # Creates a new product for the product group # if none is found product = Product.live .where(bti_id: bti_id, product_group_id: pg.id).first product ||= Product.create(bti_id: bti_id, product_group_id: pg.id) end pg.save end end end

Once all these ‘item numbers’ are collected, the second stage of the scraper kicks in.

The scraper then goes to each individual page (like below) and scrapes all info on it. The different prices (cost and msrp) and stock are displayed when logged in.

The code that tackled this challenge is below:

def BTI.parse_product_info(a, product) # Passes in 'a' = Mechanize.new and a product record page = BTI.login(a) # Navigates individual product page page = a.get("https://bti-usa.com/public/item/#{product.bti_id}") # Converts mechanize to nokogiri data raw_xml = page.parser # Load associated product group pg = product.product_group # If the product is no longer on the site, # archive both its product group and product # so that it will not be scraped in the future if raw_xml.css("#errorCell").any? pg.archive product.archive return end # Parses category from bread crumbs in header category_parent_name = raw_xml.css('.crumbs') .css('a').first(2).last.try(:text) category_child_name = raw_xml.css('.crumbs') .css('a').first(4).last.try(:text) # Finds or create the parent and child category category_parent = Category.where(name: category_parent_name, parent: true).first_or_create category_child = Category.where(name: category_child_name) .first_or_create # Moves the product group and/or # product to an activated state if # they were categorized as needed to be scraped pg.activate if pg.scraped? product.activate if product.scraped? # Adds the parent and/or child category to the # product group if not already categorized in it pg.categories << category_parent unless pg.categories.include?(category_parent) pg.categories << category_child unless pg.categories.include?(category_child) # Parses the brand out of the page pg.brand = raw_xml.css('.headline').css('span').text #Updates the product group in the database pg.save # Grabs the image record if one exists images = raw_xml.css(".itemTable").css("img")[1] # If an image exists change the url to the largest # image stored on the server if images image_url = images.attributes["src"].value. gsub('thumbnails/large', 'pictures') product.photo_url = "https://bti-usa.com" + image_url end # If the product requires special authorization sell, mark it # as so product.authorization_required = !(!!page.form_with(:action => '/public/add_to_cart') or !!raw_xml.search('//img/@src') .to_s.match('/images/stockalert.gif')) # Finds the model of the product by parsing out the brand name product.model = pg.name.gsub(pg.brand, '') product.save # Parses the different product prices (Featured below) parse_product_price(raw_xml, product) # Parses out all product variations raw_xml.css('.itemSpecTable').css('tr').each do |variation| # Grabs key and value of each bullet point key = variation.css('.specLabel').text value = variation.css('.specData').text # Saves mpn in product MPN field if key == "vendor part #:" product.mpn = value product.save if product.changed? end # Parses out junk unless key == "" or value == "" or key == "BTI part #:" or key == "vendor part #:" or key == "UPC:" variation = Variation.where(key: key.gsub(':',''), value: value.gsub('/', ' / ') .titleize, product_id: product.id) .first_or_create end end end def BTI.parse_product_price(raw_xml, item) # Grabs item name and all html containing # price info title_bar = raw_xml.css("h3") name = parse_noko(title_bar).gsub("\"", "") tds = raw_xml.css("div#bodyDiv").css("td") # Resets product prices and stock price = 0.0 msrp = 0.0 sale = 0.0 stock = 0 # Loops through html table and parses out # price and stock info # Have to loop due to BTI not believing in # css classes (0..100).to_a.each do |i| unless tds[i].nil? parsed_item = parse_noko(tds[i]) case parsed_item when "price:" price = parse_noko(tds[i+1], true).to_f when "onsale!" sale = parse_noko(tds[i+1], true).to_f when "MSRP:" msrp = parse_noko(tds[i+1], true).to_f when "remaining:" stock = parse_noko(tds[i+1], true).to_i end end end # Updates product data and # commits it to database item.name = name item.msrp_price = msrp item.sale_price = sale item.regular_price = price item.stock = stock item.save # Outputs to screen puts " * #{name}

" puts " *** Price - #{price}

" puts " *** Stock - #{stock}

" puts "

" end # Clean up nokogiri new line and return junk def BTI.parse_noko(raw, with_spaces = false) raw_text = raw.text if with_spaces raw_text = raw_text.gsub(" ", "") end raw_text.gsub("\r", "").gsub("

","").gsub("\t","").gsub("$","").gsub(",","") end

This second stage parsed over 23,000 items in just under two hours.

In order to get this quick time, I set up my app to process the scraping through multi-threaded background jobs using the Sidekiq. This allowed me to do 25 concurrent page requests at a time. Also if one of these request failed due to server errors on my distributor’s side, the job would be re-qued and processed a few minutes later.

By using Heroku Scheduler, this task would scale up background worker dynos when started, scrape all the data and turn them off when complete.

When it comes to lean business model, this scraper was it. The Heroku server only cost me $9 a month due to having to upgrade from the free postgres database. I didn’t have to pay for worker dynos since Heroku gives 750 free hours of server time every month.

I had all the data I needed, now it was just a matter of getting the products in front of customers!





Shopify and their API #

I began with trying to push to my existing Wordpress shop, but due to difficulties with the Woocommerce API, I knew there was a better solution.

When you need a cutting edge, simple ecommerce store with a simple API, Shopify is definitely the solution. Within 15 minutes I had the store setup with the correct sales tax amounts and merchant payment accounts (processed through Stripe) .

I went back to my app and built a rake task to manipulate my data into Shopify’s schema.

Within a day the task was done and all 23,000 sku’s were being pushed to my store.

The store was up and the products were live. For a few months I was able to concentrate on the important things and let one rake task take care of updating stock and uploading new products.

I ended up shutting down the site and shop this month (June, 2014) due to it not being worth my time. It was incredibly difficult to turn a substantial profit due to a huge amount of competition in the industry and not being able to lower my prices below MAP.

It was a fun ride but it’s time for the next venture.

Checkout the entire app: GitHub

Also I am always looking for new ideas to work on. So let me know if you are in need of a Ruby Dev to kick ass on your project.

Wanna chat? Shoot me an email Follow me at @matteleonard to see what I’m up to next Check out my personal site at Mattl.co

71 Kudos