Last week, my team at Google announced that we’d be hosting all of Stack Overflow’s Q&A data on BigQuery.

Here are some of the most interesting insights about Stack Overflow that we’ve uncovered so far.

Setting up the data dump

Nick Craver at Stack Overflow announced a new dataset dump on Friday:

We quickly loaded the full data dump into BigQuery:

Stack Overflow in BigQuery updated to 2016–12–11

If you want an answer, use a question mark

Sara Robinson discovered that only 22% of Stack Overflow questions end with a question mark.

So I thought — hm… that’s interesting. But does adding a “?” actually help you get answers?

So I did an analysis of how many questions got an “accepted answer.” I then grouped them by whether or not they ended with a question mark.

It turns out that in 2016, 78% of questions ending in “?” got an accepted answer versus only only 73% of questions that didn’t end in “?”. And this pattern remains consistent if you look back through the years.

So if you want people to actually answer your Stack Overflow questions, end them with a question mark.

What about the number of answers a given question gets? Do questions that end with a “?” get more replies?

Yes, they do:

Using a question mark in 2015 and 2016 gave questions at least 7% more answers. This is even more noticeable in 2008 and 2009, during which questions with a “?” have received 23% more answers than questions without one.

Here’s the query I ran to get these results:

#standardSQL

SELECT

EXTRACT(YEAR FROM creation_date) year,

IF(title LIKE '%?', 'ends with ?', 'does not') ends_with_question,

ROUND(COUNT(accepted_answer_id )* 100/COUNT(*), 2) as answered ,

ROUND(AVG(answer_count), 3) as avg_answers

FROM `bigquery-public-data.stackoverflow.posts_questions`

WHERE creation_date < (SELECT TIMESTAMP_SUB(MAX(creation_date), INTERVAL 24*90 HOUR)

FROM `bigquery-public-data.stackoverflow.posts_questions` )

GROUP BY 1,2

ORDER BY 1,2

I built the above visualizations using re:dash.

Here’s a bonus visualization I did of how long it takes to get an answer depending on which programming language you’re asking about — and the total volume of questions and answers for each language:

Here’s an interactive version.

And here’s the query I ran to get these results:



#standardSQL

SELECT tag, COUNT(*) c, COUNT(DISTINCT b.owner_user_id) answerers, AVG(TIMESTAMP_DIFF(b.creation_date,a.creation_date, MINUTE)) time_to_answer

FROM (

SELECT *

FROM (

SELECT id, EXTRACT(YEAR FROM creation_date) year, SPLIT(tags, '|') tags, accepted_answer_id, creation_date

FROM `bigquery-public-data.stackoverflow.posts_questions`

), UNNEST(tags) tag

WHERE accepted_answer_id IS NOT null

) a

LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` b

ON a.accepted_answer_id=b.id

GROUP BY 1

HAVING c>300

ORDER BY 2 DESC

LIMIT 1000

Here’s Stack Overflow’s CEO announcing the fully query-able dataset:

One final interesting study: Graham Polley wrote a great post showing how to take Stack Overflow comments from BigQuery, run a sentiment analysis process on them with our Natural Language API and Dataflow, then bring them back to BigQuery to discover the most positive/negative communities.

His conclusion:

“Well, it turns out that Python developers post the lowest percent of negative comments overall, followed by Java, and then it’s JavaScript developers that are the (according to the NL-API) most unwelcoming to new users on Stack Overflow.” — Graham Polley

Want to learn more?

Check the GCP Big Data blog post, which includes queries on how to JOIN Stack Overflow’s data with other datasets like Hacker News and GitHub.

Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Also, here’s: