04 Sep 2014

A Scraper's Toolkit: Redis

In my opinion, Redis is now the swiss army knife for any developer writing a scraper. I can't remember a sizeable scraping project I started in the past year that didn't involve Redis somehow.

Queueing

Queueing isn't particularly tied to web scraping, it's just a necessary part. The chances are you want to parallelize your scraping - and redis provides a great way to do this. You can have multiple worker processes using BLPOP on a list, retrieving and performing the work as needed. Redis serves the worker that has been blocking the longest, which means all your worker processes will get their fair share of jobs. RQ is a great job queue / worker system for this.

There are downsides to consider though, keeping everything in-process is faster for basic operations (even if redis is really quick). On the flipside, depending on the implementation language your are using you may be limited to only one CPU whereas a multiple process approach wouldn't. Overall I find it unlikely that redis would be your bottleneck.

Priority Crawling

Sometimes your crawl isn't about speed, it's about quality / relevance. If you only want to scrape 500 pages of a website, then which 500 do you choose to get the highest quality crawl?

You could opt to just go for a breadth-first crawl; this isn't a bad method [PDF]. However logic suggests that the pages that are of most importance have the most links pointing to them. So why don't we crawl in this order?

Redis makes this easy for us, as we can use a sorted set to keep track of how many times we've found a link pointing to a particular URL (use the URL as the key, increment the value everytime you see the link), then choose which page to crawl next by selecting the record with the highest score. When a page has been crawled simply set its score to -inf to place it at the bottom of the list so you don't crawl it again.

Get the item with the highest score (next to crawl):

ZRANGEBYSCORE crawling_priorities -inf +inf WITHSCORES LIMIT 0 1

Proxy Management / Limiting

Sooner or later, if you're writing scrapers, you're going to need to use proxies. There's a lot of hype on using Tor so I'll save you the trouble - it will suck. The connections are slow, unreliable and will just generally make your life hell. Just invest in some private proxies, I highly recommend MyPrivateProxy. I have been using them for years and have well over 20 clients I've referred to them now with not even a mention of a problem.

Now, back to Redis. It's no good using proxies if you end up using the same IP over again in quick succession, and this will happen eventually if you just go with using the random approach. We can utilize sorted sets again to store the last timestamp we used each proxy, and then we can grab the proxy that hasn't been used in the longest period. Simple, but effective.

Probabilistic Data Structures

Image Source: HighlyScalable

Probabilistic data structures can be very useful in large scale scraping, because keeping the resources down is often a big concern, whereas the occassional counting error / membership query being incorrect doesn't an big issue.

Bloom Filters

By using a bloom filter, you can quickly perform membership queries to check whether you have already scraped a page. This can save a lot of memory by not having to store the URL itself. The downside is that it has a small chance of giving you an erroneous result; this is often a worthwhile tradeoff. There are libraries in every major language the help you with implementing this on top of redis. I recommend pyreBloom for Python users.

HyperLogLog

If you want to count something but don't need to check for membership of items, then you should seriously check out the new data structure in redis; the HyperLogLog. It lets you count the amount of unique things you place into it in a very small amount of memory. The catch is that you can't get back the things you're counting. It's perfect for things such as counting the amount of unique URL's you come across on a particular website though.

Extras

I use Redis for more than the above in my web scraping tasks, these are just the most popular use-cases that apply to most projects. I'd love for you guys to comment below on what other things you have come up with!

If you haven't already checked it out, I highly recommend giving my Python Web Scraping Resource a read. It covers a lot of scraping principles and ideas that I will build upon in this series.

Shameless Plug

I'm available to hire for scraping projects. Use the contact link in the sidebar if you have some data you want collecting and drop me a few details :)