Why does this data matter?

First off, it allows us to analyze, at scale, where Wikipedia gets its information from. Understanding the provenance of information used by Wikipedians, also allows us to lift a veil on its gaps — the types of sources, languages, and perspectives — that are not represented, which in turn can inform community efforts to improve coverage in underserved content areas. The data can also be reused by partners such as publishers, scholarly societies, and research projects to better understand how their works are used and found by the public.

As a freely licensed, CCO dataset, we hope researchers and partners will re-use and analyze this corpus for trends of interest to their fields and research projects. Critically, a list of the most cited sources also enables partners to try and make more or most of them accessible to readers. We can help drive digitization and open access efforts geared towards making the most commonly-cited sources free to access online.

Finally, with citations as an indicator of factual currency, knowing what works are supporting our shared knowledge gives us a glimpse into popular understanding — both how we know what we know, and what we know most about.

Reactions from friends and partners

Since this dataset should empower others to do their own analysis and incorporate insights from Wikipedia’s citations, we asked some of our friends and partners what they thought.

“Wikipedia plays a crucial role in democratizing access to knowledge and enriching our understanding of the world,“ said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition), and continued:

This new citation dataset provides a deeper level of transparency and trustworthiness to its content, and opens exciting new paths for people learn, innovate and follow their curiosity.

Geoffrey Bilder, Director of Strategic Initiatives at Crossref, said:

We are delighted to see the Wikimedia Foundation release this dataset that shows which research is most often cited in Wikipedia articles. Over the past ten years, we’ve been monitoring the rapid growth of links between Wikipedia and research outputs. It appears that Wikipedia is increasingly taking on the role of the “review article” and is set to become the de facto starting place for the researchers exploring subjects they are unfamiliar with. This means Wikipedia has become a vital gateway that drives users to published research articles and, as such, it has become one of the top referrers of DOIs in the world.

Brewster Kahle, digital librarian, looked ahead:

At the Internet Archive, we believe in the value of verifiable information. We plan to use the citation data that Wikimedia Foundation has released to inform our digitization priorities, making the most important books available to researchers worldwide. We envision a future where every citation and reference in Wikipedia is a live link into a trusted repository like the Internet Archive, empowering every Wikipedia user to fact-check and verify the information they encounter online.

“This dataset is a powerful new way to track how knowledge moves from the leading edge of scientific research into the broader collective minds of humankind as a whole,” said Jason Priem, co-founder of ImpactStory. “We’ll use it to help us fine-tune our efforts with Unpaywall in our goal to make scholarly papers open and accessible to everyone.”

What’s next?

This work extends and complements data first released in 2015, created with a python library designed by Aaron Halfaker and extended by Bahodir Mansurov. If you are planning to use this dataset, we encourage you to cite it using its canonical reference as “Citations with identifiers in Wikipedia” (hosted on FigShare).

This data release is only a first step among many to come in understanding how citations are used on Wikipedia. In the next few months, we’ll focus on additional analyses of citations in Wikimedia projects, to understand how they are accessed by readers, since we care about the public being able to verify information that Wikipedia cites. We’ll also continue to work with partners to promote the use of this data and deepen our research of citation practices on Wikipedia.

As Wikipedia becomes every more ingrained in the fabric of the world’s knowledge–as a resource that aims to provide fact-based, neutral information that people can trust–we need to understand and cultivate our citation culture and make sure we can constantly vet it for biases, gaps and omissions.