The internet enables science to be shared in real-time at a low cost to a global audience. This development has decreased the barriers to making science open, while supporting new massively collaborative models of research [1]. However, the scientific community requires tools whose workflows encourage openness [2]. Manuscripts are the cornerstone of scholarly communication, but drafting and publishing manuscripts has traditionally relied on proprietary or offline tools that do not support open scholarly writing, in which anyone is able to contribute and the contribution history is preserved and public. We introduce Manubot, a new tool and infrastructure for authoring scholarly manuscripts in the open, and report how it was instrumental for the collaborative project that led to its creation.

Based on our experience leading a recent open review [3], we discuss the advantages and challenges of open collaborative writing, a form of crowdsourcing [4]. Our review manuscript [5] was code-named the Deep Review and surveyed deep learning’s role in biology and precision medicine, a research area undergoing explosive growth. We initiated the Deep Review in August 2016 by creating a GitHub repository (https://github.com/greenelab/deep-review) to coordinate and manage contributions. GitHub is a platform designed for collaborative software development that is adaptable for collaborative writing. From the start, we made the GitHub repository public under a Creative Commons Attribution License (CC BY 4.0 at https://github.com/greenelab/deep-review/blob/master/LICENSE.md). We encouraged anyone interested to contribute by proposing changes or additions. Although we invited some specific experts to participate, most authors discovered the manuscript organically through conferences or social media, deciding to contribute without solicitation. In total, the Deep Review attracted 36 authors, who were not determined in advance, from 20 different institutions in less than two years.

The Deep Review and other studies that subsequently adopted the Manubot platform were unequivocal successes bolstered by the collaborative approach. However, inviting wide authorship brought many technical and social challenges such as how to fairly distribute credit, coordinate the scientific content, and collaboratively manage extensive reference lists. The manuscript writing process we developed using the Markdown language, the GitHub platform, and our new Manubot tool for automating manuscript generation addresses these challenges.

Manubot supports citations by adding a persistent identifier like a Digital Object Identifier (DOI) or PubMed Identifier (PMID) directly in the text so that large groups of authors do not have to coordinate reference lists. When text is changed, Manubot automatically updates the manuscript’s web page so that all authors can read and edit from the latest version. Because manuscripts are created from GitHub repositories, Manubot supports a workflow where all edits are reviewed and discussed, ensuring that the collaborative text has a cohesive style and message and that authors receive precise credit for their work. These and other features support an open collaborative writing process that is not feasible with other writing platforms.

Existing platforms work well for editing text and are widely used for scholarly writing. However, they often lack features that are important for open collaborative writing, such as versatile version control and multiple permission levels. For example, Manubot is the only platform listed in Table 1 that offers the ability to address thematically related changes together and enables multiple authors to iteratively refine proposed changes.

A summary of features that differentiate Manubot from existing collaborative writing platforms. We assessed features in June 2018 using the free version of each platform and updated our assessment in April 2019 to add the features in the bottom three rows and re-evaluate Authorea and Overleaf. Some platforms offer additional features through a paid subscription or software. 1) Additional functionality, such as bibliography management and tracking changes, is available by editing the Word document stored in OneDrive with the paid Word desktop application. 2) Conversations about modifications take place on the document as comments, annotations, or unsaved chats. There is no integrated forum for discussing and editing revisions. 3) In some circumstances, Overleaf Git commits are not modular. Edits made by distinct authors may be attributed to a single author. The GitHub Sync feature attributes all edits to the project owner.

There are many existing collaborative writing platforms ( Table 1 ) [ 6 ]. In general, platforms with “what you see is what you get” (WYSIWYG) editors, such as Microsoft Word or Google Docs, require the least technical expertise to use. On the flip side, WYSIWYG platforms can be difficult to customize and incorporate into automated computational workflows. Traditionally, LaTeX has been used for these needs, since documents are written in plain text and the system is open source and extensible. Rendering LaTeX documents requires specialized software, but webapps like Overleaf now enable collaborative authoring of LaTeX documents. Nonetheless, LaTeX-based systems are limited in that PDF (or similar) is the only fully supported output format. Alternatively, Authorea is a collaborative writing webapp whose primary output format is HTML. Authorea allows authors to write in Markdown, a limited subset of LaTeX, or their WYSIWYG HTML editor.

Although we developed Manubot with collaborative writing in mind, it can also be helpful for individuals preparing scholarly documents. Authors may choose to make their changes directly to the master branch, forgoing pull requests and reviews. This workflow retains many of Manubot’s benefits, such as transparent history, automation, and allowing outside contributors to propose changes. In cases where outside contributions are unwanted, authors can disable pull requests on GitHub. It is also possible to use Manubot on a private GitHub repository. Private manuscripts require some additional customization to disable GitHub Pages and may require a paid continuous integration plan. See the existing manuscripts for examples of the range of contribution workflows and Manubot use cases.

GitHub issues can also be used for formal peer review by independent or journal-selected reviewers. A reviewer conducting open peer review can create issues using their own GitHub account, as one reviewer did for this manuscript ( https://github.com/greenelab/meta-review/issues/124 ). Alternatively, a reviewer can post feedback with a pseudonymous GitHub account or have a trusted third party such as a journal editor post their comments anonymously. Authors can elect to respond to reviews in the GitHub issues or a public response letter ( https://github.com/greenelab/meta-review/blob/v3.0/content/response-to-reviewers.md ), creating open peer review.

The total words added to the Deep Review by each author is plotted over time (final values in parentheses). These statistics were extracted from Git commit diffs of the manuscript’s Markdown source. This figure reveals the composition of written contributions to the manuscript at every point in its history. The Deep Review was initiated in August 2016, and the first complete manuscript was released as a preprint [ 10 ] in May 2017. While the article was under review, we continued to maintain the project and accepted new contributions. The preprint was updated in January 2018, and the article was accepted by the journal in March 2018 [ 5 ]. As of March 06, 2019, the Deep Review repository accumulated 755 Git commits, 317 merged pull requests, 609 issues, and 819 GitHub stars. The notebook to generate this figure can be interactively launched ( https://mybinder.org/v2/gh/greenelab/meta-review/binder?filepath=analyses/deep-review-contrib/02.contrib-viz.ipynb ) using Binder [ 11 ], enabling users to explore alternative visualizations or analyses of the source data.

We found that this workflow was an effective compromise between fully unrestricted editing and a more heavily-structured approach that limited the authors or the sections they could edit. In addition, authors are associated with their commits, which makes it easy for contributors to receive credit for their work. Fig 2 and the GitHub contributors page ( https://github.com/greenelab/deep-review/graphs/contributors ) summarize all edits and commits from each author, providing aggregated information that is not available on most other collaborative writing platforms. Because the Manubot writing process tracks the complete history through Git commits, it enables detailed retrospective contribution analysis. These pull request and contribution tracking examples both come from Deep Review, the largest Manubot project to date, but illustrate the general principles of transparency and collaboration that are shared by all open Manubot manuscripts.

The Deep Review issue ( https://github.com/greenelab/deep-review/issues/575 ) and pull request ( https://github.com/greenelab/deep-review/pull/638 ) on protein-protein interactions demonstrate this process in practice. A new contributor identified a relevant research topic that was missing from the review manuscript with examples of how the literature would be summarized, critiqued, and integrated into the review. A maintainer confirmed that this was a desirable topic and referred to related open issues. The contributor made the pull request, and two maintainers and another participant made recommendations. After four rounds of reviews and pull request edits, a maintainer merged the changes.

Any reader can contribute to a Manubot manuscript by proposing a change through a pull request. This example involves three people: a manuscript Maintainer, an existing project Contributor, and an additional Participant in the discussion. Manuscript text is shown in solid lines on the left of the timeline and discussion on GitHub is shown by squiggly lines to the right of the timeline. The Contributor opens a GitHub issue to discuss a manuscript modification. The Maintainer and the Participant provide feedback in the issue, and the Maintainer recommends creating a GitHub pull request to update the text. The Contributor creates the pull request. It is reviewed by the Maintainer and the Participant, and the Contributor updates the pull request in response. Once the pull request is approved, the Maintainer merges the changes into the official version of the manuscript.

GitHub and the underlying Git version control system [ 7 , 8 ] also structure the writing process. The official version of the manuscript is forked by individual contributors, creating a copy they can freely modify. A contributor then adds and revises files, grouping these changes into commits. When the changes are ready to be reviewed, the series of commits are submitted as a pull request through GitHub, which notifies other authors of the pending changes. GitHub’s review interface allows anyone to comment on the changes, globally or at specific lines, asking questions or requesting modifications [ 9 ]. Conversations during review can reference other pull requests, issues, or authors, linking the relevant people and content ( Fig 1 ). Reviewing batches of revisions that focus on a single theme is more efficient than independently discussing isolated comments and edits and helps maintain consistent content and tone across different authors and reviewers. Once all requested modifications are made, the manuscript maintainers, a subset of authors with elevated GitHub permissions, formally approve the pull request and merge the changes into the official version. The process of writing and revising material can be orchestrated through GitHub with a web browser (as shown in S1 Video ) or through a local text editor.

Manubot’s collaborative writing workflow adopts standard software development strategies that enable any contributor to edit any part of the manuscript but enforce discussion and review of all proposed changes. The GitHub platform supports organizing and editing the manuscript. Manubot projects use GitHub issues for organization, opening a new issue for each discussion topic. For example, in a review manuscript like the Deep Review, this includes each primary paper under consideration. Within a paper’s issue, contributors summarize the research, discuss it (sometimes with participation from the original authors), and assess its relevance to the review. In a primary research article, issues can instead track progress on specific figures or subsections of text being drafted. Issues serve as an open to-do list and a forum for debating the main messages of the manuscript.

Manubot features

Manubot is a system for writing scholarly manuscripts via GitHub. For each manuscript, there is a corresponding Git repository. The master branch of the repository contains all of the necessary inputs to build the manuscript. Specifically, a content directory contains one or more Markdown files that define the body of the manuscript as well as a metadata file to set information such as the title, authors, keywords, and language. Figures can be hosted in the content/images subdirectory or elsewhere and specified by URL. Repositories contain scripts and other files that define how to build and deploy the manuscript. Many of these operations are delegated to the manubot Python package or other dependencies such as Pandoc, which converts between document formats, and Travis CI, which builds the manuscript in the cloud. Manubot pieces together many existing standards and technologies to encapsulate a manuscript in a repository and automatically generate outputs.

Markdown. With Manubot, manuscripts are written as plain-text Markdown files. The Markdown standard itself provides limited yet crucial formatting syntax, including the ability to embed images and format text via bold, italics, hyperlinks, headers, inline code, codeblocks, blockquotes, and numbered or bulleted lists. In addition, Manubot relies on extensions from Pandoc Markdown to enable citations, tables, captions, and equations specified using the popular TeX math syntax. Markdown with Pandoc extensions supports most formatting options required for scholarly writing [12] but currently lacks the ability to cross-reference and automatically number figures, tables, and equations. For this functionality, Manubot includes the pandoc-xnos suite of Pandoc filters. A list of formatting options officially supported by Manubot, at the time of writing, is viewable as raw Markdown (https://github.com/manubot/rootstock/raw/091ca8d85c8ef2d7af16fcc8d2ed3ebcbc187f13/content/02.delete-me.md) and the corresponding rendered HTML (https://manubot.github.io/rootstock/v/091ca8d85c8ef2d7af16fcc8d2ed3ebcbc187f13/). By virtue of its readable syntax, Markdown is well suited for version control using Git. Markdown treats a single line break between text as a space and requires two-or-more consecutive line breaks to denote a new paragraph. For optimal tracking of Markdown files with Git, we recommend placing each sentence on its own line. This convention allows Git to display diffs on a per sentence basis, avoids unnecessary reflows associated with line wrapping, and supports easy rearrangement of sentences.

Citation by identifier. Manubot includes an additional layer of citation processing, currently unique to the system. All citations point to a standard identifier, for which Manubot automatically retrieves bibliographic metadata such as the title, authors, and publication date. Table 2 presents the supported identifiers and example citations before and after Manubot processing. Authors can optionally define citation tags to provide short readable alternatives to the citation identifiers. Citation metadata is exported to the Citation Style Language (CSL) JSON Data Items format, an open standard that is widely supported by reference managers [13,14]. However, sometimes external resources provide Manubot with invalid CSL Data, which can cause errors with downstream citation processors, such as pandoc-citeproc (http://hackage.haskell.org/package/pandoc-citeproc). Therefore, Manubot removes invalid fields according to the CSL Data specification (https://github.com/citation-style-language/schema). In cases where automatic retrieval of metadata fails or produces incorrect references—which is most common for URL citations—users can manually provide the correct metadata using common reference formats. Manual metadata also supports references without standard identifiers, such as print-only newspaper articles. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Citation types supported by Manubot. Manubot allows users to cite different types of persistent identifiers. Metadata source indicates the primary resource used to retrieve bibliographic metadata. For certain identifier types, additional metadata sources are queried should the primary fail. For example, when translation-server ISBN lookup fails, Manubot tries Wikipedia’s Citoid (https://www.mediawiki.org/wiki/Citoid) service followed by the isbnlib (https://github.com/xlcnd/isbnlib) Python package. When translation-server URL lookup fails, Manubot then tries Greycite (http://greycite.knowledgeblog.org/) [15]. Raw citations enable citing works when no supported persistent identifiers exist, but require that the user specifies the metadata. Finally, authors may optionally map a named tag to any of the supported identifier types. In this example, the tag avasthi-preprints represents the DOI identifier 10.7554/eLife.38532. API: application programming interface. https://doi.org/10.1371/journal.pcbi.1007128.t002 Manubot formats bibliographies according to a CSL style specification. Styles define how references are constructed from bibliographic metadata, controlling layout details such as the maximum number of authors to list per reference. Manubot’s default style emphasizes titles and electronic (rather than print) identifiers and applies numeric-style citations [23]. Alternatively, users can also choose from thousands of predefined styles (http://editor.citationstyles.org/searchByName/) or build their own [24]. As a result, adopting the specific bibliographic format required by a journal usually just requires specifying the style’s source URL in the Manubot configuration.

Format conversion. Manubot uses Pandoc (https://pandoc.org/) to convert manuscripts from Markdown to HTML, PDF, and optionally DOCX outputs. Pandoc also supports Journal Article Tag Suite (JATS), a standard format for scholarly articles that is used by publishers, archives, and text miners [25–27]. Pandoc’s JATS support provides an avenue to integrate Manubot with the larger JATS ecosystem. In the future, journals may accept submissions in JATS. For now, Manubot’s DOCX output is usually sufficient for journal submissions that require an editable source document. Otherwise, authors generally use the PDF output for preprint and initial journal submissions. The primary Manubot output is HTML intended to be viewed in a web browser. Accordingly, manuscripts natively support JavaScript and can thus include any web-based interactive visualization, such as those produced using Vega-Lite (https://vega.github.io/vega-lite/), Bokeh (https://bokeh.pydata.org/), or Plotly (https://plot.ly/) [28,29].

Interactive features and appearance. Manubot comes with several “plugins” that can be included in manuscripts exported as HTML. These plugins add special interactive features that enhance the user experience of viewing and reading manuscripts (Fig 3). For example, with the “tooltips” plugin enabled, when the user hovers over a link to a reference or figure, a preview of that item pops up above the link, along with controls to navigate between other mentions of that item elsewhere in the document. The build process can also accommodate different “themes”, which change the general aesthetics and appearance of the exported document (e.g. from a contemporary sans-serif style to a more traditional serif style). The architecture of the plugins and themes is designed to provide authors with enough flexibility to suit their particular needs and preferences. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Examples of the various Manubot plugins, illustrating their functionality and usefulness. Screenshots were taken from existing manuscripts made with Manubot: Sci-Hub Coverage Study (https://greenelab.github.io/scihub-manuscript/v/fd7acb7ed0108c920da56f84819ce13f02f68aa8/) and TPOT-FSS (https://trang1618.github.io/tpot-fss-ms/), available under the CC BY 4.0 License. Clarifying markups are overlaid in purple. https://doi.org/10.1371/journal.pcbi.1007128.g003 The Manubot “front-end” (layout, look, controls, behavior, etc.) was developed in line with current best practices and user expectations of the modern web. The plugins use standard technology built in to most major web browsers, allowing them to be relatively lightweight, modular, and easy to configure.

Continuous publication. Manubot performs continuous publication: Every update to a manuscript’s source is automatically reflected in the online outputs. The approach uses continuous integration (CI) [30–32], specifically via Travis CI, to monitor changes. When changes occur, the CI service attempts to generate an updated manuscript. If this process is error free, the CI service timestamps the manuscript and uploads the output files to the GitHub repository. Because the HTML manuscript is hosted using GitHub Pages, the CI service automatically deploys the new manuscript version when it pushes the updated outputs to GitHub. Using CI to build the manuscript automatically catches many common errors, such as misspelled citations, invalid formatting, or misconfigured software dependencies. To illustrate, the source GitHub repository for this article is https://github.com/greenelab/meta-review. When this repository changes, Travis CI rebuilds the manuscript (https://travis-ci.org/greenelab/meta-review). If successful, the output is deployed back to GitHub (to dedicated output and gh-pages branches). As a result, https://greenelab.github.io/meta-review stays up to date with the latest HTML manuscript. Furthermore, versioned URLs, such as https://greenelab.github.io/meta-review/v/4b6396bcefd1b9c7ddf39c1d3f0b3eab2dd63f31/, provide access to previous manuscript versions.

Timestamping. The idea of the “priority of discovery” is important to science, and Vale and Hyman discuss the importance of both disclosure and validation [33]. In their framework, disclosure occurs when a scientific output is released to the world. However, for a manuscript that is shared as it is written, being able to establish priority could be challenging. Manubot supports OpenTimestamps (https://opentimestamps.org/) to timestamp the HTML and PDF outputs on the Bitcoin blockchain. This procedure allows one to retrospectively prove that a manuscript version existed prior to its blockchain-verifiable timestamp [17,34–37]. Timestamps protect against attempts to rewrite a manuscript’s history and ensure accurate histories, potentially alleviating certain authorship or priority disputes. Because all Bitcoin transactions compete for limited space on the blockchain, the fees required to send a single transaction can be high. OpenTimestamps minimizes fees by encoding many timestamps into a single Bitcoin transaction, enabling the service to be free of charge [38]. Since transactions can take up to a few days to be made, Manubot initially stores incomplete timestamps and upgrades them in future continuous deployment builds. We find that this asynchronous design with timestamps precise to the day is suitable for the purposes of scientific writing.

Reproducible manuscripts. Manubot and its dependencies are free of charge and largely open source. It does rely on gratis services from two proprietary platforms: GitHub and Travis CI. Fortunately, lock-in to these services is minimal, and several substitutes already exist. Manubot provides a substantial step towards end-to-end document reproducibility, where every figure or piece of data in a manuscript can be traced back to its origin [39] and is well-suited for preserving provenance. For example, figures can be specified using versioned URLs that refer to the code that created them. In addition, manuscripts can be templated, so that numerical values or tables are inserted directly from the repository that created them. The Fig 2 caption provides examples of templates. Phrases such as “755 Git commits” are written as {{total_commits}} Git commits so that the commit count can be automatically updated.

Getting started. An example repository at https://github.com/manubot/rootstock, referred to as Rootstock, demonstrates Manubot’s features and serves as a template for users to write their own manuscripts with Manubot. The current setup process includes cloning the Rootstock repository, rebranding it to the user’s manuscript, and configuring continuous integration. The setup process is complex but must only be performed once per manuscript. Incorporating new Manubot features into an existing manuscript is also possible by pulling the latest commits from Rootstock, which sometimes involves resolving Git conflicts. Contributing to a manuscript is less technical and can be performed entirely through GitHub’s web interface, as discussed in the contribution workflow section and demonstrated in S1 Video. Interested readers can practice editing a demo manuscript at https://github.com/manubot/try-manubot. At the 2019 Pacific Symposium on Biocomputing, we led a working group where 17 conference participants contributed to a different demo manuscript (https://git.dhimmel.com/psb-manuscript/). Based on this experience, we believe most computational scholars have the expertise to contribute to a Manubot manuscript. Proficiency with Manubot requires familiarity with Markdown, Git, GitHub, and continuous integration. While these tools do present a barrier to entry, they are also highly applicable outside of Manubot and increasingly part of the standard curriculum for computational scholars. For example, Markdown is used for documenting Jupyter and R Markdown notebooks.

Existing manuscripts. Since its creation to facilitate the Deep Review, Manubot has been used to write a variety of scholarly documents. The Sci-Hub Coverage Study (https://github.com/greenelab/scihub-manuscript)—performed openly on GitHub from its inception—investigated Sci-Hub’s repository of pirated articles [40]. Sci-Hub reviewed (https://github.com/greenelab/scihub-manuscript/issues/17) the initial preprint from this study in a series of tweets, pointing out a major error in one of the analyses. Within hours, the authors used Markdown’s strikethrough formatting in Manubot to cross-out the errant sentences (commit at https://github.com/greenelab/scihub-manuscript/commit/8fcd0cd665f6fb5f39bed7e26b940aa27d4770ba, versioned manuscript) at https://greenelab.github.io/scihub-manuscript/v/8fcd0cd665f6fb5f39bed7e26b940aa27d4770ba/, thereby alerting readers to the mistake and preventing further propagation of misinformation. One month later, a larger set of revisions (https://github.com/greenelab/scihub-manuscript/pull/19) explained the error in more detail and was included in a second version of the preprint. As such, continuous publication via Manubot helped the authors address the error without delay, while retaining a public version history of the process. This Sci-Hub Coverage Study preprint was the most viewed (http://web.archive.org/web/20171221221858/http://www.prepubmed.org/top_preprints/) 2017 PeerJ Preprint, while the Deep Review was the most viewed 2017 bioRxiv preprint [41]. Hence, in Manubot’s first year, two of the most popular preprints were written using its collaborative, open, and review-driven authoring process. Additional research studies are being authored using Manubot, spanning the fields of regulatory genomics (https://vsmalladi.github.io/tfsee-manuscript/ and https://simonvh.github.io/gimmemotifs-manuscript/) [42], synthetic biology (https://zach-hensel.github.io/low-noise-manuscript/) [43], climate science (https://openclimatedata.github.io/global-emissions/), visual perception (https://laurentperrinet.github.io/2019-05_illusions-visuelles/) [44], machine learning (https://trang1618.github.io/tpot-fss-ms/) [45], computational toolkits (https://jmonlong.github.io/manu-vgsv/) [46], and data visualization (https://yt-project.github.io/yt-3.0-paper/). Manubot is also being used for documents beyond traditional journal publications, such as research tips (https://benjamin-lee.github.io/deep-rules/), quality standards (https://indigo-dc.github.io/sqa-baseline/) [47], grant proposals (https://greenelab.github.io/manufund-2018/), progress reports (https://greenelab.github.io/czi-hca-report/), undergraduate research reports (https://zietzm.github.io/Vagelos2017/) [48], literature reviews (https://slochower.github.io/synthetic-motor-literature/), and lab notebooks. Finally, manuscripts written with other authoring systems have been successfully ported to Manubot, including the Bitcoin Whitepaper (https://git.dhimmel.com/bitcoin-whitepaper/) [49] and Project Rephetio (https://git.dhimmel.com/rephetio-manuscript/) manuscript [50].

Citation utilities. The manubot Python package provides easy access to Manubot’s citation-by-identifier infrastructure, whose functionality extends beyond just Manubot manuscripts. For example, the Kipoi (https://kipoi.org/) model zoo for genomics [51] uses Manubot’s Python interface to retrieve model authors from persistent identifiers. In addition, the manubot cite command line utility takes a list of citations and returns either a rendered bibliography or CSL Data Items (i.e. JSON-formatted reference metadata). For example, the following command outputs a Markdown reference list for the two specified articles according to the bibliographic style of PeerJ: manubot cite --render --format = markdown \ --csl = https://github.com/citation-style-language/styles/raw/master/peerj.csl \ pmid:29618526 doi: 10.1038/550143a Pandoc brands itself as a “universal document converter”, and can convert from any of 32 input formats to any of 51 output formats as of version 2.7. Thanks to its versatility and active development since 2006, Pandoc enjoys a large userbase across many disciplines and applications. Its filter interface enables adding custom functionality with community-developed programs. We are prototyping a Manubot-based citation-by-identifier filter. This filter would allow Pandoc users to cite persistent identifiers as part of their existing Pandoc workflows, without requiring them to adopt other aspects of Manubot. It could help popularize citation-by-identifier at an influential scale.