Hacker News on BigQuery: Now with daily updates — So what are the top domains?

We published a copy of all Hacker News contents in BigQuery back in 2015. It was time for an update, and even better than that, how about daily updates? In this post let’s look at the HN favorite sources, and changes in the best times to post.

Thanks to the BigQuery Public Datasets Program, we now update Hacker News in BigQuery daily. Nothing better than having fresh data to analyze every day. To celebrate the occasion, I want to look at the top domains that Hacker News uses as sources. I recently did something similar for reddit, which was full of interesting surprises.

So what are the top domains shared in Hacker News during 2017?

Most frequent domains on HN, 2017. Source: Hacker News copy shared in BigQuery

#standardSQL

SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') domain, COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017

GROUP BY domain ORDER BY c DESC LIMIT 10

That’s interesting, but not what most users see. Let’s rank by number of posts that have scored at least 40 upvotes — the ones that make it to the front page:

Most frequent domains with score>40 on HN, 2017. Source: Hacker News copy shared in BigQuery

#standardSQL

SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40

FROM `bigquery-public-data.hacker_news.full`

WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017

GROUP BY 1 ORDER BY 3 DESC LIMIT 10

Certainly Hacker News likes content hosted on sites like github.com and the NYTimes. But some of those radios look abysmal. What domains have the best chance of getting more than 40 upvotes?

Domains which have landed the most % of submissions in the HN frontpage, 2017 (score>40, count>30). Source: Hacker News copy shared in BigQuery

#standardSQL

SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40

, ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage

FROM `bigquery-public-data.hacker_news.full`

WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017

GROUP BY 1

HAVING count>30

ORDER BY 4 DESC LIMIT 20

Oh, that’s cool! As a Googler I’m also proud to see that 3 of our blogs are on the top 10 of Hacker News worthy content. And a shoutout to Gwern Branwen, the only individual author in the top 20.

Let’s look at all the domains that have submitted at least 300 posts already this year:

All domains with > 300 posts on HN, 2017, ranked by chances of getting >40 upvotes. Source: Hacker News copy shared in BigQuery

#standardSQL

SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40

, ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage

FROM `bigquery-public-data.hacker_news.full`

WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017

GROUP BY 1

HAVING count>300

ORDER BY 4 DESC

Lessons learned:

Main stream sites like the washingtonpost.com, bbc.com, bloomberg.com, and nytimes.com have the greatest chance of producing front-page worthy content. All above 10%.

TechCrunch continues to be one of the HN favorite tech sources (>9%).

For self hosting the best platform seems to be github.com (>7%). Meanwhile self hosted sites like medium.com and hackernoon.com (both Medium platform) have a < 3% chance of reaching the front page. So should I move this post out of Medium and into a GitHub project for HN to read it?

Wikipedia and Youtube videos have a low chance of frontpage (~2%).

Don’t use link shorteners (goo.gl, youtu.be): They exhibit 0% chance of frontpage.

Looks consistent with the 2016 results:

All domains with > 1350 posts to HN in 2016, ranked by chances of getting >40 upvotes. Source: Hacker News copy shared in BigQuery

And who remembers these top domains from 2008 and 2009?