Hacking Hacker News

Hacker News data has recently been released on Google’s BigQuery data engine. BigQuery is a giant clustered SQL engine that can query enormous amounts of data very quickly.

Quick Links If You Just Want to Play

This article talks about building out a data model in LookML to be able to explore this data.

If you are in a rush, you can:

Start with the Raw Data

Navigating to the Hacker News data in BigQuery, we see there are two tables, stories and comments. Both tables are relatively simple in structure.

Table Stories

Each story contains a story id, the author that made the post, when it was written, the score the story achieved. I believe this is ‘points’ on news.ycombinator.com Each story also has a title and the URL containing the content.

Getting Started

Let’s first start with Stories. First we run Looker’s generator to create a LookML model for stories. Each field in the table will have an associated LookML dimension. These dimensions are used both in Looker’s Explorer and to write SQL statements and run them in BigQuery.

Without any changing any of this code, we can explore the data immediately.

- explore: stories - view: stories sql_table_name: | [fh-bigquery:hackernews.stories] fields: - measure: count type: count drill_fields: detail* - dimension: id type: int sql: ${TABLE}.id - dimension: by type: string sql: ${TABLE}.[by] - dimension: score type: int sql: ${TABLE}.score - dimension: time type: int sql: ${TABLE}.[time] - dimension_group: post type: time timeframes: [time, date, week, month, year] sql: ${TABLE}.time_ts - dimension: title type: string sql: ${TABLE}.title - dimension: url type: string sql: ${TABLE}.url - dimension: text type: string sql: ${TABLE}.text - dimension_group: deleted type: time timeframes: [time, date, week, month] sql: ${TABLE}.deleted - dimension: dead type: yesno sql: ${TABLE}.dead - dimension: descendants type: int sql: ${TABLE}.descendants - dimension: author type: string sql: ${TABLE}.author sets: detail: [id, post_time, author, title]

Let’s Start Exploring

The first and most obvious question is, ‘How many stories are there?’

[Explore From Here](http://looker.com/publicdata/looks/39?show=fields,data)

Is the number of stories going up or down? Let’s look by year.

[Explore From Here](http://looker.com/publicdata/looks/107?show=fields,vis,data) [Explore From Here](http://looker.com/publicdata/looks/108?show=fields,vis,data)

Looks like stories peaked in 2013.

How Do I Get My Story to the Top of Hacker News?

I’ve tried to post a couple of times, but my stories never seem to go anywhere. Let’s find my stories. I’m going to filter by ‘lloydt’ (my Hacker News name) and let’s take a look.

[Explore From Here](http://looker.com/publicdata/looks/144?show=fields,data)

Clicking on any of the counts will lead to my stories. Clicking on the 4 in the 2013 row’s count column shows my stories for that year.

[Explore From Here](http://looker.com/publicdata/looks/102?show=fields,data)

My best story scored a 4. Looking on the front page of Hacker news at the moment, we see posts with a variety of scores, but the lowest looks to be 7.

How Are the Scores Distributed?

If use the score as a dimension (group by score, in SQL) and count the number of posts with each score, we should get an idea about how likely a story is to get a given score. Looking at the table and graph below we can see that many stories have a score in the 1-4 range like my stories.

Lucky 7

[Explore From Here](http://looker.com/publicdata/looks/14?show=fields,data) [Explore From Here](http://looker.com/publicdata/looks/99?show=fields,vis,data)

The goal here is to try and figure out if a story made it to the front page of Hacker News. Many stories each day are posted and most don’t get there. It is pretty obvious that the distribution is heavily bifurcated; there are some stories that make it, but most stories don’t.

Unfortunately, there is no obvious way to see the split in the data.

Sometimes, just picking a somewhat arbitrary threshold will help us find it. In this case, I’m going to pick 7 as a threshold for an interesting story. Later, we can investigate different thresholds, but for now, I’m going to say if a story has a score of 7 or more, it’s interesting.

Let’s build a new dimension. In LookML. Note we can reuse the score definition (${score}) in the main model to create score_7_plus.

- dimension: score_7_plus type: yesno sql: ${score} >= 7

Using the new score_7_plus dimension, we run a query and we can see that about 15% (300K/1959K) of the stories have a score of 7 or above.

[Explore From Here](http://looker.com/publicdata/looks/105?show=fields,data)

Looker, using the LookML model, is writing all the SQL for us behind the scenes and sending it to BigQuery. The query it sent on our behalf was:

SELECT CASE WHEN stories.score >= 7 THEN 'Yes' ELSE 'No' END AS stories_score_7_plus, COUNT(*) AS stories_count FROM [fh-bigquery:hackernews.stories] AS stories GROUP EACH BY 1 ORDER BY 2 DESC LIMIT 500

Who is the King of Hacker News?

Getting a front page story is no easy feat. Let’s see if we can figure out of someone out there does it consistently. We’re going to use our lucky 7 as our threshold. To examine this, we going to hold our grouping by Score 7 Plus and additionally group by Author. Looker lets us pivot the results. We’re also going to sort by the Yes Count column to find the person with the most posts with a score of 7 or more.

[Explore From Here](http://looker.com/publicdata/looks/16?show=fields,data)

It looks like an author cwan has had the most posts that have made it to the front page. To see what cwan posts about, we just click on his story count. All looker counts drill into the detail behind them. Let’s look at cwan’s posts.

Finding Top Posters about a Particular Subject.

[Explore From Here](http://looker.com/publicdata/looks/88?show=fields,data)

Filtering data to a smaller set can help us find trends about a particular subject.

We can go back to our original “top poster” query and research whose posts contain the word ‘Facebook’. We’ll see a different set of people.

Notice that the top poster ssclanfani has had 122 posts with ‘Facebook’ in the title and 65 of them have a score 7 or higher (about 50%).

iProject has had 323 posts about Facebook and only 29 have scored 7 or higher (about 10%).

[Explore From Here](http://looker.com/publicdata/looks/95?show=fields,data)

Often, the devil is in the details. Many times, I’ve clicked into a number and looked at the underlying data records and seen some pattern. Let’s look at ssclafani’s Facebook posts and see if we can find something interesting. By clicking into the 57, we can see his posts.

Generalizing Hit Rate

[Explore From Here](http://looker.com/publicdata/looks/106?show=fields,data)

Pivoting the data is helpful, but we’re still doing some calculations by hand. We can create a couple of custom measures that will help us understand the data more readily.

We’ll create a count of just the posts that scored 7 and above.

Then we’ll create a measure that is the percentage of all posts that scored 7 and above.

LookML makes creating these measures easy.

Notice that we reuse the definition of score_7_plus.

- measure: count_score_7_plus type: count drill_fields: detail* filters: score_7_plus: Yes

And we reuse the definition of count_score_7_plus in the following definition.

- measure: percent_7_plus type: number sql: 100.0 * ${count_score_7_plus} / ${count} decimals: 2

With the new measures, we can rebuild and run our previous query. The percentage measure really helps us see that the author Slimy is quite good at placing stories; 65.22% of his stories score 7 or higher.

Are There People Better than the Author Called ‘Slimy’?

[Explore From Here](http://looker.com/publicdata/looks/96?show=fields,data)

Another advantage of creating a new measure is we can now sort by it. Let’s sort by Percent 7 Plus and look at people that have posted more than 5 stories (again, an arbitrary number).

Where Do the Stories Live?

[Explore From Here](http://looker.com/publicdata/looks/84?show=fields,data)

Hacker News only contains titles and urls that point places (and comments). Let’s take a look where the stories that are posted live. In order to do this, we’ll have to parse out the host from the URL. We’ll build a dimension in LookML that does this. BigQuery’s SQL has a regular expression extractor that makes it pretty easy. LookML also has a way that we can write the html for the thing we are displaying.

We add the dimension to our model:

- dimension: url_host sql: REGEXP_EXTRACT(${url},'http://([^/]+)/')

And now we can look at stories by the host they were posted to. Let’s sort by Score 7 Plus.

[Explore From Here](http://looker.com/publicdata/looks/110?show=fields,data)

And a peek at the sql that Looker wrote for us:

SELECT REGEXP_EXTRACT(stories.url,'http://([^/]+)/') AS stories_url_host, COUNT(*) AS stories_count, COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END) AS stories_count_score_7_plus, 100.0 * (COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END)) / (COUNT(*)) AS stories_percent_7_plus FROM [fh-bigquery:hackernews.stories] AS stories GROUP EACH BY 1 ORDER BY 3 DESC LIMIT 500

Domains Are Better

Domains are probably more interesting then hosts. After all, www.techcrunch.com and techcrunch.com both appear in this list. So let’s build up another field that parses domain out of the host. We have to be careful to deal with hosts like ‘bbc.co.uk’, so we look for domains that end in two letters and grab more data.

- dimension: url_domain sql: REGEXP_EXTRACT(${url_host},'([^\\.]+\\.[^\\.]+(?:\\.[a-zA-Z].)?)$')

[Explore From Here](http://looker.com/publicdata/looks/15?show=fields,data)

Are there domains that are more successful than others? Lets look at hosts by Percent 7 Plus.

[Explore From Here](http://looker.com/publicdata/looks/89?show=fields,data)

Whoops, looks like a bunch of one-hit-wonders. Let’s eliminate hosts that have had less than 20 successful posts.

Building a Better Indication that a Post Was on the Front Page

[Explore From Here](http://looker.com/publicdata/looks/97?show=fields,data)

There is an old joke about a group of people that encounter a bear in the woods. They all start running from the bear. The joke is that you don’t have to outrun the bear, you have to outrun the other people.

Hacker News scores are like that. We probably don’t care what the actual score is, we just care that its better then the other scores being posted on the same day.

We’ll rank the the score for each day starting with 1 as the best score for the day and moving down.

In order to compute the daily rank, we’ll need to use SQL’s window function and a derived table in LookML. The output is a two column table with the id of the story and the rank of the story on the day it was posted.

- view: daily_rank derived_table: sql: | SELECT id , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank FROM ( SELECT id , DATE(time_ts) as post_date , score FROM [fh-bigquery:hackernews.stories] WHERE score > 0 ) fields: - dimension: id primary_key: true hidden: true - dimension: daily_rank type: number

We can join this table into our stories explore.

- explore: stories joins: - join: daily_rank sql_on: ${stories.id} = ${daily_rank.id} relationship: one_to_one

We can then look at our data by daily_rank and see the number of stories that match this. The data looks right. There are some 3000 days and a story for each rank for each day.

[Explore From Here](http://looker.com/publicdata/looks/85?show=fields,data)

The SQL that Looker wrote for this query is below. As the model gets more and more complex, so do the queries, but asking the question remains simple.

SELECT daily_rank.daily_rank AS daily_rank_daily_rank, COUNT(*) AS stories_count FROM [fh-bigquery:hackernews.stories] AS stories LEFT JOIN (SELECT id , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank FROM ( SELECT id , DATE(time_ts) as post_date , score FROM [fh-bigquery:hackernews.stories] WHERE score > 0 ) ) AS daily_rank ON stories.id = daily_rank.id GROUP EACH BY 1 ORDER BY 1 LIMIT 500

Let’s Build a New Top 25 Set of Dimensions and Measures

Like we did before, building dimensions and measures into the model will allow us to think in these terms.

We build them in a very similar way that we built our Score 7 measures. Notice we simply reference ${daily_rank.rank} and Looker figures out how to write the SQL to make it all fit together.

# Was this post in the top 25 on a given day? - dimension: rank_25_or_less type: yesno sql: ${daily_rank.rank} <= 25 # How many posts were in the top 25 out of this group of posts? - measure: count_rank_25_or_less type: count drill_fields: detail* filters: rank_25_or_less: Yes # What Percentage of posts were in the top 25 in group set of posts? - measure: percent_rank_25_or_less type: number sql: 100.0 * ${count_rank_25_or_less} / ${count} decimals: 2

And the simple output. Looks like about 4% of posts make it to the top 25 on a given day.

[Explore From Here](http://looker.com/publicdata/looks/100?show=fields,data)

Now let’s look at it by poster. Looks like Paul Graham (whose author name is “pg”)has had lots of top 25 posts and a very high hit rate.

Wow. Looking by Domain Is an Amazing List

[Explore From Here](http://looker.com/publicdata/looks/103?show=fields,data)

Rerunning the query, this time by target domain with high story counts with rank 25 or less gives us a fascinating list of domains. The obvious ones are there, nytimes, bbc.co.uk, but scrolling down a little, I find domains I don’t know about. Following the links (we’ll talk about how to make these later) usualy takes me to an interesting place.

Common Words in Top Posts

[Explore From Here](http://looker.com/publicdata/looks/104?show=fields,data)

We can now find top posts. Let’s see if we can figure out why some posts are top. Are top posts talking about something in particular? Let’s see if we can find common words in posts.

First, we’re going to build a derived table that has two columns, the story id, and a word that appeared in the title.

view: story_words derived_table: sql: | SELECT id, SPLIT(title," ") as word FROM [fh-bigquery:hackernews.stories] stories fields: - dimension: id primary_key: true hidden: true - dimension: word

Next we’ll build an explore definition (the join relationships). We’ll reuse both our prior views (the story view and the daily_rank view). The basis of this explore is a word, not a story.

- explore: story_words joins: - join: stories sql_on: ${story_words.id}=${stories.id} relationship: many_to_one type: left_outer_each - join: daily_rank sql_on: ${stories.id} = ${daily_rank.id} relationship: one_to_one type: left_outer_each

Now we can explore by word. Let’s look at the words in the posts with a rank of 25 or less. Scroll down a little to look past some of the small common words

[Explore From Here](http://looker.com/publicdata/looks/86?show=fields,data)

and again, the SQL Looker is writing for us:

SELECT story_words.word AS story_words_word, COUNT(DISTINCT CASE WHEN daily_rank.daily_rank <= 25 THEN stories.id ELSE NULL END, 1000) AS stories_count_rank_25_or_less FROM (SELECT id, SPLIT(title," ") as word FROM [fh-bigquery:hackernews.stories] stories ) AS story_words LEFT JOIN EACH [fh-bigquery:hackernews.stories] AS stories ON story_words.id=stories.id LEFT JOIN EACH (SELECT id , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank FROM ( SELECT id , DATE(time_ts) as post_date , score FROM [fh-bigquery:hackernews.stories] WHERE score > 0 ) ) AS daily_rank ON stories.id = daily_rank.id GROUP EACH BY 1 ORDER BY 2 DESC LIMIT 500

Of course clicking on any of the numbers will drill in and show us of the stories.

Eliminating the Common Words with a Shakespeare

The common words are a problem. It would be great to eliminate or at least flag them.

To do this, we’re going to use an inspired little hack.

BigQuery provides a nice little table of all the words in Shakespeare. The table consists of the word, the corpus it appeared in and what year the corpus was written.

We are going to find these 1000 words and then flag the words that we encounter that appear in the 1000 word list.

First, we write a little query to find the 1000 most common words in Shakespeare.

SELECT lower(word) as ssword , count(distinct corpus) as c FROM [publicdata:samples.shakespeare] GROUP BY 1 ORDER BY 2 DESC LIMIT 1000

With this word list, we can modify our derived table to have a new column, ssword, which if NOT NULL, means the word appears in Shakespeare (and we would consider it common).

- view: story_words derived_table: sql: | SELECT a.id as id, a.word as word, b.ssword as ssword FROM FLATTEN(( SELECT id, LOWER(SPLIT(title," ")) as word FROM [fh-bigquery:hackernews.stories] stories ), word) as a LEFT JOIN ( SELECT lower(word) as ssword , count(distinct corpus) as c FROM [publicdata:samples.shakespeare] GROUP BY 1 ORDER BY 2 DESC LIMIT 1000) as b ON a.word = b.ssword fields: - dimension: id primary_key: true hidden: true - dimension: word - dimension: is_comon_word type: yesno sql: ${TABLE}.ssword IS NOT NULL

Now rerunning our query, with the common field, we can see what we’ve isolated some of the more common words.

[Explore From Here](http://looker.com/publicdata/looks/92?show=fields,data)

And now without the common words:

Finally … Which Words, If in the Title of the Story, Are Most Likely to Get You on the Front Page

Comparing

[Explore From Here](http://looker.com/publicdata/looks/94?show=fields,data)[Explore From Here](http://looker.com/publicdata/looks/111?show=fields,data)

Now with a few clicks we can start comparing by filtering words to Microsoft, Google and Facebook.

Let’s compare front page posts by year.

Wiring This into an Application

[Explore From Here](http://looker.com/publicdata/looks/87?show=fields,data)

The next step is to make a data discovery application and cross wire all the research we’ve done so far. We easily build a dashboard that shows posts, over time, by domain, by author, by word, and success rates into making to a score of 7, and from a score of 7 into the top 25.

We wire up filters for author, domain and word, so that any of these will change all the data on the dashboard.

For example, let’s look at Paul Graham (author name “pg”). He is posting a little less over time, likes to post about ycombinator, talks about yc, applications, hn and startups. His posts look very successful.

One of the nice things we can do in Looker is to create links when we render the result cells. Dimensions have an html: parameter that is rendered with liquid templating.

Using this mechanism, we can cross link everywhere we display author, domain and word to point to a dashboard.

For example, we link author to both the dashboard and the profile page on Hacker News. We use emoji’s to make it all work.

- dimension: author type: string sql: ${TABLE}.author html: | {{ linked_value }} <a href="/dashboards/169?author={{value}}" title="Goto Dashboard" target=new>⚡</a> <a href="https://news.ycombinator.com/user?id={{value}}" title="Goto news.ycombinator.com" target=new>➚</a>

Other Ideas to Research

There is lots more to investigate.