Preprints have arrived. In increasing numbers, researchers across the life sciences are embracing the once-niche practice, shaking off decades of reluctance and posting hundreds of papers per week to preprint servers, sharing their findings with the community before embarking on the weary march through peer review. However, there are limited methods for individuals sifting through this avalanche of research to identify the preprints that are most relevant to their interests. Here, we describe Rxivist.org , a website that indexes all preprints posted to bioRxiv.org , the largest preprint server in the life sciences, and allows users to filter and sort papers based on download metrics and Twitter activity over a variety of categories and time periods. In this work, we hope to make it easier for readers to find relevant research on bioRxiv and to improve the visibility of preprints currently being read and discussed online.

Funding: RB is supported by the National Institutes of General Medicine (R35-GM128716) and a McKnight Land-Grant Professorship from the University of Minnesota. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2019 Abdill, Blekhman. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The evaluation of these download counts and other “altmetrics” [ 6 ] is difficult to contextualize across and within fields (see Discussion ). New metrics can also reinforce new incarnations of the “Matthew Effect,” a “rich get richer” dynamic in which famous scientists receive more attention (and citations) for their work [ 7 ]. Still, download metrics and Twitter activity present an interesting opportunity to organize preprints using metrics less arbitrary than chronology, as bioRxiv does. Using metadata continually collected about the full corpus of bioRxiv preprints, we built Rxivist.org (pronounced "Archivist"), a website enabling users to search, filter, and sort preprints based on download metrics and the number of Twitter messages linking to it. We hope this tool will be useful for researchers throughout the life sciences who either have too many preprints to read or are unfamiliar with the medium and are looking for somewhere to start.

A preprint is a publicly available academic paper that has not yet been published in a peer-reviewed journal. Though the acceptance and popularity of preprints took longer to take root in the life sciences than in fields such as physics and mathematics [ 1 ], more than 215,000 authors have posted preprints to bioRxiv.org [ 2 ], the website that now houses more biology preprints than all other major preprint servers combined [ 3 ]. The exponential growth of biology preprints has far outstripped even the largest pre-internet attempts [ 4 ] at circulating unrefereed publications: In the 1960s, the National Institutes of Health operated one such program, which mailed 2,561 different “memos” over the course of 6 years [ 5 ]. BioRxiv publishes that number of preprints every 5 weeks and now houses more than 47,000 papers across 27 disciplines [ 2 ], available not just to a rarefied cohort of academic subscribers but everyone with access to the web. Although less than 200 preprints were posted per month in late 2015, a quick glance at a dozen titles per day is no longer sufficient to keep up with all of the new research appearing online. BioRxiv (pronounced "Bio Archive") offers a conservative set of options for viewing these submissions: A standard text search includes the option to view the latest papers matching a search term, and preprints are broken down into 27 "subject areas" (i.e., "cancer biology," "bioinformatics," "immunology," and so on) that can be listed in reverse-chronological order. Email alerts also offer the option to receive notifications about new preprints matching search criteria. Despite these conveniences, the rapid (and expanding) rate of submissions is making the task of parsing these papers an increasingly impractical proposition—in the neuroscience category—bioRxiv's largest—447 preprints were posted just in March 2019 [ 2 ]. Although bioRxiv provides download data about each preprint, there is no way to use that information when searching.

Results

The Rxivist application is made of 2 pieces: an application programming interface (API) that provides preprint data in JavaScript Object Notation (JSON) format, and a Python-based website that uses this data to build lists of preprints that conform to a user's search parameters. These services provide human- or machine-readable access to browsable data on preprint altmetrics and a list of the preprints currently being discussed on Twitter.

Preprint listings Users visiting the homepage will find the default search parameters already filled in, displaying the 20 most discussed preprints on Twitter.com since the beginning of the previous day. New preprints are pulled from bioRxiv 6 times per day, along with updated Twitter activity reflecting which ones are currently being discussed (see Methods). This means a visitor who checks Rxivist.org once per day should always find new content. In the default view, preprints with more than 110 tweets in the current day are marked with a "fire" icon to signify a paper with an exceptional number of tweets in that day. This level was selected by sorting recent daily Twitter data going back to September 2018 and determining what value would have resulted in 35 percent of nonweekend days having a “hot” paper. The search box (Fig 1) provides several options for modifying this search. Results can be restricted by category, a parameter that can be combined with a modified timeframe: Twitter data can be used based on the previous 1, 7, or 30 days or viewed without any time restrictions, which incorporates tweet counts dating back to early 2017. For example, a user could request the most discussed microbiology preprints of the last week (https://rxivist.org/?category=microbiology&timeframe=week). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. The top of the The top of the Rxivist.org homepage. The default results settings for the Rxivist.org homepage, including the search box and top results based on Twitter metrics (A) on 1 March 2019. Below the text search field (B) are 4 drop-down menus (C) that provide the other available parameters—which metric to use in the ranking process, whether to limit results to a particular category, the timeframe in which the metrics should be limited, and how many results to return at one time. https://doi.org/10.1371/journal.pbio.3000269.g001 Twitter metrics provide a strong signal for capturing the online day-to-day discussions about preprints, and daily readers will find the Twitter search results to generate a more dynamic list of recommendations. Monthly download data sourced from the bioRxiv website provide a longer-term sorting method that has lower resolution but may have a more direct connection to the actual readership of a particular preprint. Because downloads are only specified in monthly intervals and each preprint's metrics are updated in the Rxivist index about once every 2 weeks, users can choose from a smaller set of timeframes, either starting at the beginning of the previous month, year-to-date totals, or all-time downloads. Category-level filtering is still available for these lists, so a user could ask, e.g., for the most downloaded bioinformatics preprints of the current year (https://rxivist.org/?metric=downloads&category=bioinformatics&timeframe=ytd). Several other pages segment the data in a different way: There is a page listing the most downloaded preprints of 2018 (https://rxivist.org/top/2018), which lists 25 preprints posted in that year and orders them based on downloads through December 2018. Similar pages are available for papers dating back to 2013 (https://rxivist.org/top/2013). In addition, a summary page (https://rxivist.org/stats) visualizes overall metrics for the bioRxiv collection, including monthly totals for submissions and downloads (Fig 2). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. The summary metrics page. The summary metrics page (https://rxivist.org/stats) includes a chart of submissions per month (A), plus a similar chart broken down by category (B) and a table showing the categories that have received the most preprints in the current month (C). Users can highlight individual categories in the chart using buttons (D). The final graph (E) plots total monthly downloads. https://doi.org/10.1371/journal.pbio.3000269.g002

Detailed profiles In addition to generating sorted lists of preprints, the data scraped from bioRxiv is also used to create profile pages for each preprint and author that has been indexed. Each preprint has a profile page specifying its title, abstract, digital object identifier (DOI), and 2 plots of longitudinal download data: one visualizes downloads over time and the other shows where that preprint's total download count compares to all others (Fig 3). Whereas the histogram compares the paper to all other preprints on bioRxiv, each profile also includes download rankings in multiple timeframes, including all-time rankings both site-wide and within the category to which it was first posted. In-category rankings are probably the most informative of these comparisons, because download counts vary widely between categories [8]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Preprint-level metrics visualization. A screenshot showing typical graphs of download metrics displayed on an Rxivist profile page for an individual preprint. The left plot shows a single paper's downloads (y-axis) per month (x-axis), and the right plot is a histogram (with a log scale on the x-axis) of total downloads per preprint of all preprints on bioRxiv, including an indication of which bin includes the preprint in question. This example is from the page at https://rxivist.org/papers/10.1101/210294. https://doi.org/10.1371/journal.pbio.3000269.g003 Each paper profile page also includes an embedded visualization of data from Altmetric.com, a commercial service that indexes mentions of academic works (including preprints) within an expansive collection of social media platforms. Altmetric, like Crossref, does not offer a publicly available method of browsing these results. Each preprint's profile also includes a full author list that links to individual profile pages for each author. Author profiles are more complex and combine data from all preprints attributed to that author based on name or open researcher and contributor identifier (ORCID ID; see Methods). Each author profile includes basic information, such as their name and the institutional affiliation specified on their most recent preprint, plus download rankings that compare each author based on the cumulative downloads of all their preprints, which is also displayed. Each author is given a site-wide rank for all-time downloads and also receives a ranking in each category to which they've posted a preprint. There is also a histogram (similar to the one in Fig 3) that shows the distribution of total downloads per author and indicates where the author in question falls. Beneath the download information is a list of all preprints for which the individual is listed as an author, plus the individual paper rankings for each.