Sci-Hub’s cache of pirated papers is so big, subscription journals are doomed, data analyst suggests

There is no doubt that Sci-Hub, the infamous—and, according to a U.S. court, illegal—online repository of pirated research papers, is enormously popular. (See Science ’s investigation last year of who is downloading papers from Sci-Hub.) But just how enormous is its repository? That is the question biodata scientist Daniel Himmelstein at the University of Pennsylvania and colleagues recently set out to answer, after an assist from Sci-Hub.

Their findings, published in a preprint on the PeerJ journal site on 20 July, indicate that Sci-Hub can instantly provide access to more than two-thirds of all scholarly articles, an amount that Himmelstein says is “even higher” than he anticipated. For research papers protected by a paywall, the study found Sci-Hub’s reach is greater still, with instant access to 85% of all papers published in subscription journals. For some major publishers, such as Elsevier, more than 97% of their catalog of journal articles is being stored on Sci-Hub’s servers—meaning they can be accessed there for free.

Given that Sci-Hub has access to almost every paper a scientist would ever want to read, and can quickly obtain requested papers it doesn’t have, could the website truly topple traditional publishing? In a chat with Science Insider, Himmelstein concludes that the results of his study could mark “the beginning of the end” for paywalled research. This interview has been edited for clarity and brevity.

Q: What made you want to look at the size of Sci-Hub’s coverage?

A: It all started when Sci-Hub tweeted the list of all the articles that they had stored in their repositories on March 19. I thought: “Wow, we can learn so much about their operations and coverage that we couldn’t before.” Most people knew that Sci-Hub provided access to some of the scholarly literature, but the question was how much.

Q: How did you approach this calculation?

A: The main step was figuring out how many scholarly articles existed. For that we used data from Crossref, which has a database of journal identifiers or DOIs [digital object identifiers]. It’s not the only one, but it’s by far the most common one for scholarly publishing. After making some exclusions, we compiled a list of 81.6 million articles. This step was important because it gave us the denominator for the equation. Previous people who’ve looked at Sci-Hub coverage didn’t really get this step right—to see what percent of the literature Sci-Hub has, you need to know the total amount.

Q: What were the main findings of your study?

A: The most simple result was that Sci-Hub contains 69% of all scholarly articles. We also found that the site preferentially covers articles from closed-access publishers and high-impact journals. [Editor’s Note: A breakdown can be found here.] I think it's interesting that Elsevier and the American Chemical Society had some of the highest coverage and those are the publishers that have sued Sci-Hub. Maybe they realized that basically their entire corpus was in Sci-Hub. There were a lot of journals where Sci-Hub has every single article.

Q: What about the other 31%?

A: Just because an article isn’t in Sci-Hub’s database, that doesn’t mean it can’t get it for you. We estimated that Sci-Hub was able to fulfill requests 99% of the time—that suggests the 31% of articles that aren’t covered by Sci-Hub are things that people really aren’t requesting.

Q: Did you look at how coverage varied by academic discipline?

A: Yes. There was some variation between fields, but I think it’s probably less than people have speculated in the past. The top was chemistry with 93% coverage, and at the low end was computer science at 76%. The results could be linked to publishing practices in those fields—we found closed-access journals had more coverage than open access.

Q: Sci-Hub has faced a number of legal challenges—do you think these will stop it?

A: In our paper we have a graph plotting the history of Sci-Hub against Google Trends—each legal challenge resulted in a spike in Google searches [for the site], which suggests the challenges are basically generating free advertising for Sci-Hub. I think the suits are not going to stop Sci-Hub.

Q: How do you think Sci-Hub will evolve in future?

A: In the paper we mentioned that there are technologies coming that would allow you to host files without any central point of failure, so going forward Sci-Hub, or a service like it, could still provide access to all these papers, but there wouldn’t be any domain or one person behind it. Right now, if the servers for Sci-Hub were found they could be seized and destroyed.

Q: Do you really foresee a time when librarians would endorse Sci-Hub over paying for journal access?

A: I don’t think librarians would ever endorse it, given the legal issues of instructing someone to do something illegal. But in a way they already do. There are many libraries nowadays that can’t provide 100% access to the scholarly literature. Globally, it’s a pretty small percentage of universities that offer full access.

Q: Is there anything publishers could do to stop new papers being added to Sci-Hub’s repository?

A: There are things they could do but they can really backfire terribly. The issue is the more protective the publishers are, the more difficult they make legitimate access, and that could drive people to use Sci-Hub.

Q: What do you hope the impact of this study will be?

A: I think the larger picture of this study is that this is the beginning of the end for subscription scholarly publishing. I think it is at this point inevitable that the subscription model is going to fail and more open models will be necessitated. One motivation for doing the study is that I want to bring that eventuality into reality more quickly.