Sparks have been flying between my favorite data analysis language and my favorite programmer’s Q & A site since long ago: R flirted with StackOverflow on September 10, 2008, 5 days before StackOverflow was even open to the public. R still hesitates to leave its original suitor, the loud and lively R-help mailing list, where many of the heavyweights in the community still focus their selfless hours. Yet StackOverflow’s oh-so-smooth user interface, its combed-back hair, and its determined propositions of trinkets, upvotes, memes, and prestige cast an irresistable spell of romance.

But most alluring of all, StackOverflow’s user data, along with all its StackExchange peers, is licensed as Creative Commons and released as a big XML data dump on BitTorrent. I dug in to explore the relationship between questions and answers:

After a huge initial spike of interest, the rate of asking and answering questions has continued to grow over the years. As I expected, there are more answers than questions, since people give multiple competing answers for the same question. You can also see an unsurprising dip in activity around the late-December holiday season.

Let’s dive deeper and look at the ratio of new answers to new questions per week:

The number of answers per question has plummeted from over 4 around 2 now! For each new question, there’s about half as many new answers in a week as there were in 2008. What happened?

To explore further, let’s look at how the questions and answers were distributed across the active population in any given week. How equally distributed are questions and answers across users, and how do they compare over time? The go-to tool for visualizing equality inequality, and the basis of the video, is the Lorenz curve. Here is the Lorenz curve from a few weeks after StackOverflow opened to the public:

The x-axis shows the proportion of those actively posting, and the y-axis shows the proportion of StackOverflow posts. Each point of the Questions curve is the cumulative proportion of questions (on the y axis) made by the bottom x proportion of the population (defined here as non-anonymous people posting questions or answers in a given week). For instance, you can see above that the bottom 50 percent of StackOverflowers actually asked no questions at all that week. The curve then rises in the region between 50 to 85 percent of the population, but still to less than 50 percent of the total questions; in other words, the bottom 85 percent of the users asked less than 50 percent of the questions, which means the top 15 percent account for 50 percent of the new questions that week.

Note the diagonal line: that is the line of perfect equality. If every StackOverflower asked the same number of questions, the Question curve would lie along this line. If the Question curve stayed at zero and then shot up right at 100, that would mean that one or a very few people asked all the questions. Thus, the more bowed the curve is away from the line of perfect equality, the more unequal the distribution is. Questions are here clearly more bowed away from the center line than Answers at the beginning, and basically line up with it at the end, so Questions are more unequally distributed than Answers. How does this change over time? Finally, now to our video, which goes through the last three years by week:

Keep your eye on the Questions line, especially the lower part of the population. While Answers retain the same shape over time, Questions show a trend toward more equality. From the over 50 percent of active posters not asking questions, that number hovers below 40 percent, and the Questions and Answers curves now cross; the relative amount of inequality between the curves is now ambiguous.

These results show that questions constitute a greater part of StackOverflow user activity and that questions are more equally spread out among active members. Why? I can only speculate now, but one thought is that StackOverflow users are realizing that it’s easier to get ahead in StackOverflow’s rating system (called “reputation”) by asking questions than by answering them–see this post. Reputation points matter a lot to some people, not least because some apparently have been quoting their reputation scores as a part of the job search process. Another possibility is that there are certain cumulative effects of StackOverflow that might contribute to making answering harder–since StackOverflow has been around a while, many of the easy questions may have already been answered (and duplicate questions are quickly flagged and removed). Finally, this analysis completely ignores StackOverflow’s comment system: a lot of important discussion over both questions and answers happen in the comments, and it could be having an effect not accounted for here.

What do you think?

More Resources

Other analysis

There’s a cool analysis on the CrossValidated blog on the effect of reputation on upvotes, which is related to the intriguing but currently dormant Polystats project that was kicking around the CrossValidated community.

For more general information, there’s a bunch of interesting and fairly up-to-date graphs on the various StackExchange sites here, including of course information on StackOverflow itself.

Every so often, people post on the data analysis tag on Meta StackOverflow.

Exploring the data

As mentioned above, you can download the whole data dump (around 9 gigs) of all StackExchange sites here via BitTorrent. Each dump is cumulative, so just get the latest one. To analyze it, you’ll probably want to get it into some kind of database; I used this python script to get it into SQLite, where I did some basic aggregation before importing into R. Also useful is this summary of the layout.

Another awesome resource is the StackExchange Data Explorer, which allows you to run arbitrary SQL queries on the data (as long as you end up with less than 2000 results).

Another strategy is to use the API, which even has a front-end package in R. Unfortunately, you’re limited in the number of API requests you can make per day and I found it more convenient to simply download the data myself. The API provides up-to-date information, whereas the other options date from the last data dump.

Credits

The graphs are of course created in R with ggplot2. The Lorenz curves are calculated with the help of the ineq package.

Animation

Making the animation was done in ffmpeg, with great help from this quick guide to making animations using ffmpeg. I’d also recommend checking out Yihui Xie’s work, especially his AniWiki which contains many examples of creating animations in R.