Speeding up scikit-learn workflow using a high-performance Go proxy.¶

Up until now I’ve been using vcrpy to cache my requests during the data mining phase of my scikit-learn work, but I was recently intimated to an ultra-high-performance GoLang caching proxy, and wanted to see if I could use it for more speed-ups. I was so impressed that I wrote a python wrapper for it.

pip install hoverpy --user --upgrade

Offlining readthedocs:

from hoverpy import capture import requests import time @capture ( "readthedocs.db" , recordMode = "once" ) def getLinks ( limit ): start = time . time () sites = requests . get ( "http://readthedocs.org/api/v1/project/?limit= %d &offset=0&format=json" % int ( limit )) objects = sites . json ()[ 'objects' ] for link in [ "http://readthedocs.org" + x [ 'resource_uri' ] for x in objects ]: response = requests . get ( link ) print ( "url: %s , status code: %s " % ( link , response . status_code )) print ( "Time taken: %f " % ( time . time () - start )) getLinks ( 50 )

Ouput:

[ ... ] Time taken: 9 .418862

Upon second invocation:

[ ... ] Time taken: 0 .093463

That’s much better: 100.78x faster than hitting the real endpoint.

Not surprising really. My issue with caching proxies however, is that it’s the https handshaking that takes time–not fetching the data–and one of my many annoyances with vcrpy is that it won’t let me remap https requests to http.

Therefore I was very pleased to see remapping work perfectly in hoverpy (code provided below the next graph), with hoverpy wiping the floor with vcrpy; over 13x faster:

import time import hoverpy import requests import os prot = "http" if os . path . isfile ( "hn.db" ) else "https" hnApi = " %s ://hacker-news.firebaseio.com/v0" % prot with hoverpy . HoverPy ( recordMode = 'once' , dbpath = 'hn.db' ) as hp : print ( "started hoverpy in %s mode" % hp . mode ()) start = time . time () r = requests . get ( " %s /topstories.json" % ( hnApi )) for item in r . json (): article = requests . get ( " %s /item/ %i .json" % ( hnApi , item )) . json () print ( article [ "title" ]) print ( "got articles in %f seconds" % ( time . time () - start ))

Once again, on second run, hoverfly steps in with a very significant speedup. I’m very impressed with hoverfly’s performance.

Data mining HN¶ Before we start, please note you can find the final script here. You’ll also need the data. What I also really like about Hoverfly is how fast it loads, and how fast it loads up the boltdb database. I also like the fact it’s configuration-free. Here’s a function you can use to offline titles for various HN sections: def getHNData ( verbose = False , limit = 100 , sub = "showstories" ): from hackernews import HackerNews from hackernews import settings import hoverpy , time , os dbpath = "data/hn. %s .db" % sub with hoverpy . HoverPy ( recordMode = "once" , dbpath = dbpath ) as hp : if not hp . mode () == "capture" : settings . supported_api_versions [ "v0" ] = "http://hacker-news.firebaseio.com/v0/" hn = HackerNews () titles = [] print ( "GETTING HACKERNEWS %s DATA" % sub ) subs = { "showstories" : hn . show_stories , "askstories" : hn . ask_stories , "jobstories" : hn . job_stories , "topstories" : hn . top_stories } start = time . time () for story_id in subs [ sub ]( limit = limit ): story = hn . get_item ( story_id ) if verbose : print ( story . title . lower ()) titles . append ( story . title . lower ()) print ( "got %i hackernews titles in %f seconds" % ( len ( titles ), time . time () - start )) return titles

Data mining Reddit¶ While we’re at it, let’s put a function here for offlining subreddits. This one also includes comments: def getRedditData ( verbose = False , comments = True , limit = 100 , sub = "all" ): import hoverpy , praw , time dbpath = ( "data/reddit. %s .db" % sub ) with hoverpy . HoverPy ( recordMode = 'once' , dbpath = dbpath , httpsToHttp = True ) as hp : titles = [] print "GETTING REDDIT r/ %s DATA" % sub r = praw . Reddit ( user_agent = "Karma breakdown 1.0 by /u/_Daimon_" , http_proxy = hp . httpProxy (), https_proxy = hp . httpProxy (), validate_certs = "off" ) if not hp . mode () == "capture" : r . config . api_request_delay = 0 subreddit = r . get_subreddit ( sub ) for submission in subreddit . get_hot ( limit = limit ): text = submission . title . lower () if comments : flat_comments = praw . helpers . flatten_tree ( submission . comments ) for comment in flat_comments : text += comment . body + " " if hasattr ( comment , 'body' ) else '' if verbose : print text titles . append ( text ) return titles

Organising our datamines¶ Rather than sitting around hitting these endpoints, you may as well download these datasets, to save yourself the time. wget https://github.com/shyal/hoverpy-scikitlearn/raw/master/data.tar tar xvf data.tar And the code: subs = [( 'hn' , 'showstories' ), ( 'hn' , 'askstories' ), ( 'hn' , 'jobstories' ), ( 'reddit' , 'republican' ), ( 'reddit' , 'democrat' ), ( 'reddit' , 'linux' ), ( 'reddit' , 'python' ), ( 'reddit' , 'music' ), ( 'reddit' , 'movies' ), ( 'reddit' , 'literature' ), ( 'reddit' , 'books' )] def doMining (): titles = [] target = [] getter = { 'hn' : getHNData , 'reddit' : getRedditData } for i in range ( len ( subs )): subTitles = getter [ subs [ i ][ 0 ]]( sub = subs [ i ][ 1 ]) titles += subTitles target += [ i ] * len ( subTitles ) return ( titles , target ) Calling doMining() caches everything, which takes a while. Although you’ve hopefully downloaded and extracted data.tar , in which case it shouldn’t take more than a few seconds. That’s all our data mining done. I think this is a good time to remind ourselves a big part of machine learning is, in fact, data sanitisation and mining. GETTING HACKERNEWS showstories DATA got 54 hackernews titles in 0.099983 seconds GETTING HACKERNEWS askstories DATA got 92 hackernews titles in 0.160661 seconds GETTING HACKERNEWS jobstories DATA got 12 hackernews titles in 0.024908 seconds GETTING REDDIT r / republican DATA GETTING REDDIT r / democrat DATA GETTING REDDIT r / linux DATA GETTING REDDIT r / python DATA GETTING REDDIT r / music DATA GETTING REDDIT r / movies DATA GETTING REDDIT r / literature DATA GETTING REDDIT r / books DATA real 0 m9 . 425 s

Building an HN or Reddit classifier¶ OK time to play. Let’s build a naive bayesian text classifier. You’ll be able to type in some text, and it’ll tell you which subreddit it thinks the text could have originated from. For this part, you’ll need scikit-learn. pip install numpy pip install scikit-learn Test scentences: sentences = [ "powershell and openssl compatability testing" , "compiling source code on ubuntu" , "wifi drivers keep crashing" , "cron jobs" , "training day was a great movie with a legendary director" , "michael bay should remake lord of the rings, set in the future" , "hilary clinton may win voters' hearts" , "donald trump may donimate the presidency" , "reading dead wood gives me far more pleasure than using kindles" , "hiring a back end engineer" , "guitar is louder than the piano although electronic is best" , "drum solo and singer from the rolling stones" , "hiring a back end engineer" , "javascript loader" , "dostoevsky's existentialis" ] Running the classifier: def main (): titles , target = doMining () from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB # build our count vectoriser # count_vect = CountVectorizer () X_train_counts = count_vect . fit_transform ( titles ) # build tfidf transformer # tfidf_transformer = TfidfTransformer () X_train_tfidf = tfidf_transformer . fit_transform ( X_train_counts ) # classifier # clf = MultinomialNB () . fit ( X_train_tfidf , target ) print "*" * 30 + "

TEST CLASSIFIER

" + "*" * 30 # predict function # def predict ( sentences ): X_new_counts = count_vect . transform ( sentences ) X_new_tfidf = tfidf_transformer . transform ( X_new_counts ) predicted = clf . predict ( X_new_tfidf ) for doc , category in zip ( sentences , predicted ): print ( ' %r => %s ' % ( doc , subs [ category ])) # predict ( sentences ) # while True : predict ([ raw_input ( "Enter title: " ) . strip ()]) In case you are not familiar with tokenizing, tfidf, classification etc. then I’ve provided a link at the end of this tutorial that’ll demistify the block above.

Wrapping things up¶ You can find hoverpy’s and hoverfly’s extensive documentation here and here. This excellent and lightweight proxy was developed by the very smart guys at SpectoLabs so I strongly suggest you show them some love (I could not, however, find a donations link). Repository for this post, with code: https://github.com/shyal/hoverpy-scikitlearn http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html Edit: As jackschultz rightly pointed out, this article is missing computing the precision of the classifier. You can check his article here. I strongly recommend readers who are interested in splitting the datasets into training and testing sets, and computing the precision of the classifier to try doing so. You’ll find all the information required to do so in the scikit-learn article. Finally also worth noting: my solution only caches the HN titles, but not the HN comments, while it does cache the Reddit comments. This will lead to huge biases towards reddit, and away from HN, despite using tfidf. So another good exercise for the reader is to attempt caching the HN comments.