Wikipedia is capable of covering news like any news agency. Photo by Kai Mörk, freely licensed under CC BY 3.0 (Germany).

For almost fifteen years, the scope of topics that Wikipedia covers has grown steadily. Now, the free online encyclopedia covers everything from music, film and video games to geography, history, and the sciences. It also contains articles on topics trending in the news, updated by tens of thousands of volunteer editors as swiftly as the news breaks.

To investigate aspects of this phenomenon, such as the speed with which breaking news is covered on Wikipedia, the verifiability of information added over time, and the distribution of edits among Wikipedia’s editors, I selected an article for further analysis in the form of a dissertation.[1]

Comparing page views and daily edit counts for the article, highlight key elements in the story’s development. Image by Joe Sutherland, freely licensed under CC BY-SA 4.0.

The article selected was “Shooting of Michael Brown“, which covered the killing of 18-year-old Michael Brown in Ferguson, Missouri, by police officer Darren Wilson. The incident attracted much press attention fuelled by local protest in the suburb of St. Louis. I observed the article’s history until January 12, 2015.

The resulting data was split into two “peaks” in the development of this story: the initial media scramble after protests began in mid-August, and the Ferguson grand jury’s decision not to indict Wilson for the teenager’s death in late November.[2] Each “peak” represented 500 individual “revisions” of the article in question. The use of peaks in this case allowed for cross-case analysis—that is, a direct comparison between two case studies.

Speed of editing

Graphing the speed of editing across both peaks of development. Image by Joe Sutherland, freely licensed under CC BY-SA 4.0.

Notably, pageviews and edit rates didn’t line up as one might expect. Instead, there was a great flurry of edits a few days after the article was created, presumably as the editing community learned of the article’s existence or heard about the event. The speed of editing was incredibly fast during this initial period of rioting and press attention, though these speeds were highly inconsistent. The mean editing rate across this period was 18.57 edits per hour, more than eleven times the overall average for the article.

Media coverage, however, seems to have a much more acute impact on pageviews: upon Darren Wilson’s indictment decision in November, almost half a million people visited the article in just one day. A somewhat surprising observation was that this second peak resulted in much slower rates of editing. The mean for this period was just 7.21 edits per hour, which was two and a half times slower than in the first. It is also very inconsistent, mirroring the first peak—editing speeds varied widely throughout both peaks and were largely unpredictable.

In terms of text added to the article, the first peak—which was observed over a much shorter period of time—saw an average of 501.02 bytes of text added per hour, some 3.6 times quicker than the rate of the second peak. By then, however, the article was much longer and the causation can likely be that there wasn’t much left to add by that point.

Use of sources

To judge the article’s accuracy is a very difficult task, which would by its very nature be subjective and require an in-depth knowledge of what happened in Ferguson that afternoon. To this end, I instead looked at the verifiability of the article—specifically, the volume of sources per kilobyte of text, referred to for this study as the article’s “reference density”.

“Reference densities” over each peak. Image by Joe Sutherland, freely licensed under CC BY-SA 4.0.

Ten samples were taken systematically for this research from each peak, and their references tallied. This was used in conjunction with the page’s size in kilobytes to find the reference density.

In both peaks, the reference density steadily increased over time. It was significantly higher overall in the earlier peak, when the article was shorter and rapidly-changing information required more verification. This rise in reference density over time likely indicates Wikipedia editors’ desire to ensure information added is not removed as unverifiable.

The majority of sources used in the article were from publications which focus on print media. This is more obvious in the second peak than the first, where local newspaper The St. Louis Post-Dispatch became much more common among the article’s sources.

Origins of sources used within the article per peak. Image by Joe Sutherland, freely licensed under CC BY-SA 4.0.

Relatedly, it was discovered that a high volume of the sources were from media based in the state of Missouri, obviously local to the shooting location itself. The proportion falling into this category actually increased by the second peak, from just over 18 percent to just over a fifth of all sources. Other local sources which were regularly used in the article included the St. Louis American and broadcasters KTVI and KMOV.

It was the state of New York which provided the majority of sources, however; this seems to indicate that editors tend towards big-name, reputable sources such as the New York Times and USA Today, which both placed highly on ranking lists. Notably, the state of Georgia was almost exclusively represented by national broadcaster CNN, yet still made up around 10 percent of all sources used.

Range of contributors

Finally, the editing patterns of users were examined to judge the distribution of edits among a number of groups. To do this, users were placed into categories based on their rates of editing—which, for the purposes of this study, was defined as their mean edits per day. Categories were selected to divide editors as evenly as possible for the analysis, and six bots were excluded to prevent the skewing of results.

Edits/day Category Count % Count of which status % Status 40+ Power users 27 4.49% 20 74.07% 10–40 Highly active users 73 12.15% 38 52.05% 5–10 Very active users 67 11.15% 26 38.81% 1–5 Active users 105 17.47% 19 18.10% 0.1–1 Casual users 92 15.31% 4 4.35% 0.01–0.1 Infrequent users 62 10.32% 0 0% <0.01 Very infrequent users 13 2.16% 0 0% IPs Anonymous users 162 26.96% 0 0% Total/average 601 100% 107 17.80%

Clearly, the majority of users in the highly active and power users brackets hold some kind of status, whether that be the “rollback” tool given out by administrators, or elected roles such as administrator or bureaucrat. This at least implies that more daily edits can translate roughly into experience or trust on the project.

Looking at data added per category, highly active users have been responsible for the vast majority of the total content added to the article—over half of the total. However, breaking it down into mean content added per edit for each category provided some intriguing results.

Mean content added per edit, in bytes, per experience category. Image by Joe Sutherland, freely licensed under CC BY-SA 4.0.

While the highly active users take this crown too, it is a much closer race. Perhaps unintuitively, “casual” editors—those with fewer than one edit per day, but more than 0.1—added an average of 95.81 bytes per edit, and the category directly below that added 93.70 bytes per edit. This suggests that article editing is not just done by the heavily-active users on Wikipedia, but by a wide range of users with vastly different editing styles and experience.

Edits to the article were most commonly made by a very small group of users. Indeed, 58 percent of edits made to the article were by the top ten contributors, while over half of contributors made just one edit. Text added to the article followed the same pattern, though more pronounced: the same top ten contributed more than two-thirds of the content article content. This lends weight to theories that Wikipedia articles tend to be worked on by a core “team”, while other individual editors contribute with more minor edits and vandalism reversion.

Overall, the study shows that Wikipedia works on breaking news much like a traditional newsroom—verifiability is held in high regard, and a “core group” of editors tend to contribute a vast majority of the content. Editing rates, however, do not match up as obviously with peaks of media activity, which is worth investigating in future more qualitatively.

If you’re interested in reading the full thesis, it’s available from my website. For more academic research into Wikipedia, consider subscribing to the monthly Wikimedia Research newsletter.

Joe Sutherland, Wikimedia Foundation communications intern

Notes

↑ Others have done research into this area; their work, methods and outcomes heavily influenced this study. In particular, Brian Keegan‘s work was instrumental in guiding the direction for this research. His 2013 study into breaking news, co-authored with Darren Gergle and Noshir Contractor, covers a far wider range than this thesis did. ↑ The first peak depicted is the 500 edits made between 09:38 UTC on 16 August 2014 and 17:54 UTC on 18 August 2014 (a period of 2 days, 8 hours and 16 minutes); the second is between 00:57 UTC on 23 November 2014 and 22:36 UTC on 01 December 2014 (a period of 8 days, 21 hours and 39 minutes).