In case you've ever wanted to create your personal "must reads" on Python related stuff, search for something in articles published a while ago - here is a simple approach how one may try to tackle the problem:

Crawl reddit's r/Python:

# -*- coding: utf-8 -*- import scrapy from scrapy.contrib.spiders import CrawlSpider , Rule from scrapy.item import Item , Field from scrapy.selector import HtmlXPathSelector class RedditItem ( Item ): title = Field () url = Field () class RpythonSpider ( CrawlSpider ): name = "rpython" allowed_domains = [ "reddit.com" ] start_urls = ( 'http://www.reddit.com/r/Python/new/' , ) crawled = set () crawled_prev_next = set () def parse ( self , response ): hxs = HtmlXPathSelector ( response ) prev_next = hxs . select ( '//span[@class="nextprev"]//a' ) . \ select ( '@href' ) . extract () if len ( prev_next ) > 0 : url = prev_next [ - 1 ] if url not in self . crawled_prev_next : self . crawled_prev_next . add ( url ) yield self . make_requests_from_url ( prev_next [ - 1 ]) for link in hxs . select ( '//*[@id="siteTable"]//div//p[1]/a' ): url = link . select ( '@href' ) . extract ()[ 0 ] title = link . select ( 'text()' ) . extract ()[ 0 ] if len ( url ) and len ( title ) and url not in self . crawled and '/r/Python/' not in url : item = RedditItem () item [ 'title' ] = title item [ 'url' ] = url self . crawled . add ( url ) yield item

That's a very basic scrapy spider, which You can invoke for instance like that:

scrapy crawl rpython -o file.json -t json

and store all links with titles that lead to external blogs/services. That way You won't miss a single article :-). Hook this script to some database, add published_at field and get notified only when there is new stuff.

Enjoy!