When will Stack Overflow reply: How to predict with BigQuery

When you finally post a question on Stack Overflow, a long wait starts. Will someone answer your question? Will the community downvote what you asked, judging your question as bad? Will someone ever give you an answer? Have they answered it yet?Can we predict any of these? Let’s do that.

Stack Overflow stats per month

I love Stack Overflow. It has answers to most of my programming questions, and when it doesn’t I can post a new one. In a normal month more than 120,000 users contribute new questions — and more than 75,000 community members are ready to help. I know, because I’ve been in both sides.

BigQuery offers many public datasets, and one of these is a quarterly updated copy of Stack Overflow. We can use this data and the recently announced BigQuery ML features to predict how long it will take to answer your Stack Overflow questions.

Did you just post a question on Stack Overflow? Let’s review how these predictions are run — while you wait.

The science behind

The first step is to measure how long a user has to wait before someone on Stack Overflow gives them an answer. We’ll get this from the questions table in BigQuery, and a correlated subquery to get the answers:

Average time to get an answer on Stack Overflow, per tag. Keep reading, because we’ll debunk these numbers in a few.

The basic query that joins answers with questions, producing the time between a question and its first answer:

WITH question_answers_join AS (

SELECT *

, GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer

FROM (

SELECT id, creation_date, title

, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c

FROM `bigquery-public-data.stackoverflow.posts_answers`

WHERE a.id=parent_id

) answers

, SPLIT(tags, '|') tags

FROM `bigquery-public-data.stackoverflow.posts_questions` a

WHERE EXTRACT(year FROM creation_date) > 2016

)

) SELECT *

FROM question_answers_join

WHERE 'google-bigquery' IN UNNEST(tags)

AND answers.c > 0

ORDER BY RAND()

LIMIT 10

Some sample questions — when were they posted, how long it took to get an answer.

That’s a good start — let’s get the average per tag, for some popular tags:

SELECT COUNT(*) questions, tag

, ROUND(AVG((minutes_2_answer)), 2) avg_minutes

FROM question_answers_join, UNNEST(tags) tag

WHERE tag IN ('javascript', 'python', 'rust', 'java', 'php', 'ruby', 'perl', 'ios', 'android', 'angularjs')

AND answers.c > 0

GROUP BY tag

ORDER BY avg_minutes

Naive average time per tag to get an answer.

Turns out Perl has the most amazing community: it’s the quickest to reply! Followed by Rust, Ruby, and PHP. The slowest communities are Android and iOS. In between you have…. STOP RIGHT THERE

Warning: Plain averages are no good here. Instead, we shall calculate the geometric mean.

I won’t go here into a full explanation of why plain averages are not good — but the first thing you should notice is these numbers are telling us we have to wait for days before getting a reply on Stack Overflow — and that doesn’t reflect our normal experience. To reduce the impact of outliers, we can ask for the geometric mean and median instead:

SELECT COUNT(*) questions, tag

, ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes

, APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median

FROM question_answers_join, UNNEST(tags) tag

WHERE tag IN ('javascript', 'python', 'rust', 'java', 'php', 'ruby', 'perl', 'ios', 'android', 'angularjs')

AND answers.c > 0

GROUP BY tag

ORDER BY mean_geo_minutes

Geometric mean and median time to get an answer in Stack Overflow per tag

That makes more sense: You’ll get your JavaScript questions answered in ~35 minutes, while Rust, Android, and iOS questions will take more than an hour to show up.

We can also calculate the chance of getting an answer:

SELECT COUNT(*) questions, tag

, ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) avg_minutes

, FORMAT('%.2f%%', 100*COUNT(minutes_2_answer)/COUNT(*)) chance_of_answer

FROM question_answers_join, UNNEST(tags) tag

WHERE tag IN ('javascript', 'python', 'rust', 'java', 'php', 'ruby', 'perl', 'ios', 'android', 'angularjs')

GROUP BY tag

ORDER BY avg_minutes

Perl, Ruby, and Rust look good here: From our selection, these are the only tags with higher than an 80% reply rate. Interestingly enough, these tags get way fewer questions than the others in this selection.

Time to answer vs chance of answer: iOS and Android are the slowest to reply, while they also have the lowest chance of replying. Perl has a high probability of getting a reply. Meanwhile Javascript, Python and PHP have more questions than Perl, and are quicker to reply. (DS source)

Now, there’s a lot of other variables we could explore: Does it take more time to get answers on a Monday or Saturday? New Stack Overflow users get faster replies than veteran users? Does it matter which word you use to start a question? Does it matter if you don’t end a question with a question mark. Longer questions or shorter questions? Time of the day? Let’s answer all of these questions:

First word: Most people ask ‘how’ questions. ‘Can’ take longer to answer than ‘why’s.

Day of week (UTC): More questions during the week, faster answers, but with a lower probability of answer.

Hour of the day (UTC). Faster answers — with lower chance of getting an answer: 14:00 UTC, which is 7am in California. The opposite at 00:00 UTC, 5pm California.

Account creation year: New users get quick replies. Veteran users ask questions that take longer to reply, but more of their questions are replied. Note there seems to be a linear relationship between account age and time to answer, but a non-linear one with probability of answer.

Length of question: Shorter questions are easier to answer, and with a higher chance.

Did they use a ?: Questions that end with a question mark get quicker answers with a higher chance?

My basic query to get all of these dimensions:

WITH answers AS (

SELECT *

, REGEXP_EXTRACT(LOWER(title), '[a-z]+') first_word

, '?' = SUBSTR(title, LENGTH(title)) ends_question

, GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_first_answer

, answers.c > 0 was_answered

FROM (

SELECT creation_date, title

, CASE WHEN LENGTH(body)<700 THEN 'short' WHEN LENGTH(body)<1700 THEN 'medium' ELSE 'long' END question_length

, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c

FROM `bigquery-public-data.stackoverflow.posts_answers`

WHERE a.id=parent_id

) answers

,(SELECT AS STRUCT EXTRACT(year FROM creation_date) account_creation_year

FROM `bigquery-public-data.stackoverflow.users`

WHERE a.owner_user_id=id

) user

, SPLIT(tags, '|') tags

FROM `bigquery-public-data.stackoverflow.posts_questions` a

WHERE EXTRACT(year FROM creation_date) > 2016

)

) SELECT tag, 60*EXP(AVG(LOG(minutes_first_answer ))) time_to_answer, COUNT(minutes_first_answer)/COUNT(*) chance_of_answer

, COUNT(*) questions

-- first_word, ends_question, tag

-- , FORMAT_TIMESTAMP('%H', creation_date) hour_utc

-- , FORMAT_TIMESTAMP('%A', creation_date) weekday_utc

-- , user.account_creation_year

-- , question_length

FROM answers, UNNEST(tags) tag

WHERE tag IN ('javascript', 'python', 'rust', 'java', 'php', 'ruby', 'perl', 'ios', 'android', 'angularjs')

--AND first_word IN UNNEST(['why','what','how','is','can', 'i'])

GROUP BY 1

Combining everything: Regressions!

So how do we combine all of these averages? What if I have a short SQL question at 3pm on a Thursday, and I’ve been a Stack Overflow user since 2009. My question starts with ‘what’ and I’ll make sure to ask it in the form of a question. Can we create a formula that encapsulates all of these metrics, even if no one has ever attempted this combination before?

Yes! That’s why BigQuery now supports linear and logistic regressions (expect more). Say hello to my friend, BigQuery ML.

To create a linear regression model that combines all of these features into one prediction of how long you have to wait for an answer, just do this:

CREATE MODEL `dataset.stacko_predicted_time`

OPTIONS (model_type='linear_reg' ) AS SELECT first_word, ends_question, tag

, FORMAT_TIMESTAMP('%H', creation_date) hour_utc

, FORMAT_TIMESTAMP('%A', creation_date) weekday_utc

, user.account_creation_year

, question_length

, LOG(minutes_first_answer) label

FROM answers, UNNEST(tags) tag

WHERE was_answered

This query goes over 3,784,571 questions and their replies, on over 46,178 different tags. You might notice that I duplicate question for each of their tags (when they have multiple tags), which gives me a total of 11,297,337 rows. This might give some tags an unfair balance on the equation, but we can discuss that later. How long did it take to train this model? 6 minutes, the last time I tried.