Scraping the thread: Hacker News API

The first step is getting the data. Luckily, Hacker News provides a very nice API to freely scrape all of its content. The API has endpoints for posts, users, top posts a few others. For this article we will use the one for posts. It’s very simple to use, here is the basic syntax: v0/item/{id}/.json where id is the item we are interested in. In this case the thread’s id is 18661546, so here is an example on how to get the main page data:



main _page = requests.request(‘GET’, ‘ import requestsmain _page = requests.request(‘GET’, ‘ https://hackernews.firebaseio.com/v0/item/18661546.json').json()

The same API call is also used for the sub posts of a thread or a post, whose ids can be found in the kids key of the parent post. Looping over the kids we can get the text of every post in the thread.

Cleaning the data

Now that we have the text data we want to extract book titles from it. One possible approach would be to look for all Amazon or Goodreads links in the article and just group by that. This is a clean approach because it doesn’t depend on any text processing. However, just from taking a quick look at the thread it is clear that the vast majority of suggestions do not have any link associated to them. So I decided to go for the more difficult route: grouping ngrams together and match those ngrams with possible books.

So, after eliminating special characters from the text I grouped together bigrams, trigrams, 4-grams and 5-grams and count the occurrences. This is an example to count bigrams:

import re

from collections import Counter

import operator # clean special characters

text_clean = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for t in text for k in t.split("

")] # count occurrences of bigrams in different posts

countsb = Counter()

words = re.compile(r'\w+')

for t in text_clean:

w = words.findall(t.lower())

countsb.update(zip(w,w[1:])) # sort results

bigrams = sorted(

countsb.items(),

key=operator.itemgetter(1),

reverse=True

)

Usually in text application one of the first thing to do while processing the data is to eliminate stopwords, i.e. the most common words in a language, like articles and prepositions. In our case we did not eliminate stopwords from our text yet, therefore most of these ngrams would be almost exclusively composed of stopwords. In fact, here is a sample output of the top 10 most common bigrams in our data:

[((u'of', u'the'), 147),

((u'in', u'the'), 76),

((u'it', u's'), 67),

((u'this', u'book'), 52),

((u'this', u'year'), 49),

((u'if', u'you'), 45),

((u'and', u'the'), 44),

((u'i', u've'), 44),

((u'to', u'the'), 40),

((u'i', u'read'), 37)]

Having stopwords in our data is fine, most title books would have stopwords in them so we want to keep these. However, to avoid looking up too many combinations we eliminate the ngrams that are solely composed of stopwords, keeping all the others.

Checking book titles: the Goodreads API

Now that we have a list of possible ngrams we will use the Goodreads API to check if these ngrams correspond to book titles. In case multiple matches are available for a search I decided to take the most recent publication as the result of the search. This is assuming that the most recent book would be the most likely match for this context. This is of course an assumption that might lead to errors.

The Goodreads API is a bit less straightforward to use than the Hacker News one as it returns results in XML, which is less friendly to use than the JSON format. In this analysis I used the xmltodict python package to convert the XML to JSON. The API method we need is search.books which allows to search books by title, author or ISBN. Here is a code sample to get book title and author for the most recently published search result:

import xmltodict res = requests.get(" https://www.goodreads.com/search/index.xml " , params={"key": grkey, "q":'some book title'}) xpars = xmltodict.parse(res.text)

json1 = json.dumps(xpars)

d = json.loads(json1) lst = d['GoodreadsResponse']['search']['results']['work']

ys = [int(lst[j]['original_publication_year']['#text']) for j in range(len(lst))] title = lst[np.argmax(ys)]['best_book']['title']

author = lst[np.argmax(ys)]['best_book']['author']['name']

This method allows us to associate ngrams to possible books. We check the list of books we get matching all ngrams with the Goodreads API against the full text data. Before performing the actual check we cut the book names eliminating punctuation (particularly semicolumns) and subtitles. We only consider the main title with assumption that most of the time only this part of the title would be used (some of the full titles in the list are actually really long!). Ranking the results we get by number of occurences in the thread we get this list: