[Other posts in this series: 2,3,4.]

I had the chance to have dinner tonight with Paul Ginsparg of arXiv fame, and he graciously gave me some feedback on a very speculative idea that I’ve been kicking around: augmenting — or even replacing — the current academic article model with collaborative documents.

Even after years of mulling it over, my thoughts on this aren’t fully formed. But I thought I’d share my thinking, however incomplete, after incorporating Paul’s commentary while it is still fresh in my memory. First, let me start with some of the motivating problems as I see them:

People still reference papers from 40 years ago for key calculations (not just for historical interest or apportioning credit). They often have such poor typesetting that they are hard to read, don’t have machine-readable text, no URL links, etc.

Getting oriented on a topic often requires reading a dozen or more scattered papers with varying notation, where the key advances (as judged with hindsight) are mixed in with material that is much less important.

More specifically, papers sometimes have a small crucial idea that is buried in tangential details having to do with that particular author’s use for the idea, even if the idea has grown way beyond the author.

Some authors could contribute the key idea, but others could contribute clarity of thought, or make connections to other fields. In general these people may not know each other, or be able to easily collaborate.

There aren’t enough good review articles. When the marginal cost of producing a textbook is near zero, the fact that no one gets proper credit for writing good textbooks isn’t so bad simply because you only need one or two good ones, and the audience is huge. Random fluctuations are sufficient for capturing most of the low-hanging fruit. However, as the potential audience shrinks, it becomes more and more important to set the rewards for writing good document in line with the benefits to the community. a (Conjecture:) Hyperspecialization in technical fields leads to a gradual slow down in progress as it becomes more and more difficult to learn everything that is needed to go beyond the state of the art. Part of this is just that learning the theory underlying a field requires you to learn all of that theory’s conceptual dependencies. We have been able to mitigate this problem by having researchers specialize and by creating dedicated schools and course, but this becomes worse past the graduate level because we don’t have enough good incentives for didactic material. Relatedly, certain advancements will come from combining ideas from multiple subfields where it’s not feasible to simultaneously be an expert in all of them. We need better ways to become well versed in a field without reaching expert level. b The gap between textbook and review articles is way too large for anyone looking far from their area of expertise.

The gap between textbook and review articles is way too large for anyone looking far from their area of expertise. (Added:)Papers are missing important links (citations) to related work, either because the related work came after publication or because the authors weren’t aware of it. Backwards citation services (“cited by” lists) only fix a small part of this.

The root problem is that academics are writing the same sort of fixed documents, either alone or with a tiny number of collaborators, that they have been doing for centuriesThe arXiv has let people easily release new versions of papers, which is a clear improvement, but this is a minor effect. c . The natural solution is to make academic articles universally collaborative. Experience has shown that allowing many authors to contribute to the same document can let them produce something none were capable of individually. Needless to say, this would be a monumental shiftThis doesn’t necessarily require much top-down coordination. The ultimate goal could be to introduce a new norm in physics that everyone publishes their paper in a form that is free to be modified, just like the early ’90s brought a new norm for releasing work on the arXiv (which previously might have seemed very unlikely). d . Let’s list some existing examples we might copy.

Wikipedia: This is the stupendously successful experiment that no one in their right mind would have predicted to work. Importantly, it shows that people will sometimes contribute to a collaborative document with no material reward whatsoever. However, the popular accessibility means that the number of readers per author needed is probably much less than in academics.

Scholarpedia: Like wikipedia, but intended for academics. It uses a page ownership model like the defunct wikipedia cousins Nupedia and Citizendium. Appointed experts own pages in their area of expertise. Other users can submit edits, but they must be approved by the owner. Requires more top-down administration than Wikipedia, and is currently somewhat sparse, but does have some very nice articles by top experts.

Knowen: A new upstart whose lofty aim is to be the wikipedia of scientific knowledge. Claims to cover course notes, introductory texts, and new research all in the same framework (example). At the moment, just a skeleton. ( Added : Knowen’s creator Ivar Martin has an illuminating comment below.)

: Knowen’s creator Ivar Martin has an illuminating comment below.) Polymath: A “group blog” that has achieved some modest success obtaining novel mathematical results through an online discussion and writing format.

The StackExchange family of websites: The best question-and-answer forum on the web. They have demonstrated that properly constructed online norms and discussion format can have a huge positive impact compared to a free-for-all.

GitHub: The gold standard in collaborative editing of software, and an important tool for open source. Very detailed tracking of contributions by authors (example), which is useful for setting up incentives.

A big contingent question is whether academic physics should stick with TeX or move to a different document processing standard.Personally, I think the ideal place to aim for is a new intermediate language somewhere between TeX and Markdown in complexity. But that’s a post for another day. For now, we could just really use a decent TeX editor. e After all, some standards are going to be vastly easier than others for tracking changes, merging different versions, etc. In this post I’m just going to put that issue aside. (You can imagine we’re sticking with TeX if you like.)

By looking at the examples above, we can identify at least three distinct models that might be adopted:

Ownership. Papers are posted to the arXiv just like now, with some fixed list of authors. Changes can be suggested (a la GitHub, Scholarpedia, StackExchange) that the original authors can approve or reject. One could consider an even more modest version of this, where contributors could only submit comments (possibly consisting only of citations), that could be either approved or rejected by the authors, with the option of reply. The current process for in-journal discussion takes several months between replies, at the least. Although having dozens of small back-and-forth replies is tiresome and inefficient for the reader (such as in a free-for-all blog post), it is likely that significant improvements can be made to the traditional journals by allowing somewhat shorter and faster replies electronically. Here, we can learn a lot from StackExchange websites. Ideally, comments on papers could appear alongside the original document at the user’s discretion, perhaps with a voting system to determine which comments are worthwhile for casual readers. Or on each page there could be an adjustable slider that would filter comments made by the authors, commenters approved by the authors, any arXiv user, or the public at large. f Forking. Papers are posted to the arXiv just like now, but anyone can make changes and fork the new version as a separate paper without affecting the original posting. Open-wiki. Papers are constructed from scratch like on Wikipedia. The earliest version of an article need not stand on its own, but might (say) only have an outline. Free-for-all authorship.

In principle, all of these models might co-exist with each other and with traditional papers, although I imagine some would lose out eventually. The ownership method is probably the smallest change and the one that academics would have least difficulty adjusting to.

The above discussion mostly concentrates on the various technical aspects, but these can all very likely be handled with time and money. Much more challenging are the substantial social obstacles that Paul and I spent a while discussing.

Author control . Author’s often have enough difficulty getting along with their immediate collaborator, so there will doubtless be some resistance to having their work edited by strangers. This will be entangled with how attribution is done, and whether a ownership, forking, or open-wiki model is adopted.

. Author’s often have enough difficulty getting along with their immediate collaborator, so there will doubtless be some resistance to having their work edited by strangers. This will be entangled with how attribution is done, and whether a ownership, forking, or open-wiki model is adopted. Attribution . Each model will have different possible methods for attributing content to different authors. Are all people who contributed listed as authors, or do the original authors appear in a distinguished position? Are individual changes indexed? At what level of granularity? With the right diffing software, it’s already possible to track arbitrarily small changes to TeX files from many users in a reasonable way.

. Each model will have different possible methods for attributing content to different authors. Are all people who contributed listed as authors, or do the original authors appear in a distinguished position? Are individual changes indexed? At what level of granularity? With the right diffing software, it’s already possible to track arbitrarily small changes to TeX files from many users in a reasonable way. Evaluation . How will others evaluate an author’s contribution when it can now be arbitrarily minor. Are hiring committees going to check which individual lines you added to a document? Many similar issues have not stopped GitHub from assuming a bit of the role of a CV in the software industry.

. How will others evaluate an author’s contribution when it can now be arbitrarily minor. Are hiring committees going to check which individual lines you added to a document? Many similar issues have not stopped GitHub from assuming a bit of the role of a CV in the software industry. Author incentives . Given the possibilities for evaluation, will author’s have enough incentive to contribute meaningfully to collaboratively edited documents? (Could disincentives arise because of academic infighting that are not found in Wikipedia?)

. Given the possibilities for evaluation, will author’s have enough incentive to contribute meaningfully to collaboratively edited documents? (Could disincentives arise because of academic infighting that are not found in Wikipedia?) Network effects. We can expect that it will be difficult for any collaborative editing forum to get started, since a lot of the potential benefits have strong network effects. (For example, no one is going to value your contributions on such a forum during your job hunt if they haven’t heard of it yet.) This could be mitigated by driving the frictions to editing as low as possible, so that you can capture the natural human urge to improve something even without reward that Wikipedia seems to tap into. One might also leverage the existing clout of the arXiv by convincing its advisory board to endorse or integrate a collaborative forum into the site.

There’s also the issue of licensing. Collaborative editing requires that academics release their work under more permissive licenses than they usually do currently. Paul was kind enough to look up how many of the papers submitted to the arXiv were released under each of the four available licenses (absolute numbers, and percentages):

The minimal arXiv license, which does not allow for collaborative editing of the original document, clearly dominates. Part of this is because this is the default option and there’s currently not much reason to change it. But part of it is that people are wary of giving up the rights to their hard work, and will default to maintaining as much control as possible unless they have good reason to do otherwise. So this is another non-trivial barrierThere are annoying copyright interactions with any journal an article might potentially be published in. An increase in collaborative licensing would accelerate the (good) move away from closed-access journals. But that also means you can expect serious push back from existing publishers. g .

As a closing side note, I was pleased to hear from Paul that he expects the

arXiv will eventually include a space on each article’s abstract page where the author can provide links to related videos, papers, course notes, data, etc. This should make it vastly easier for people to link up video abstracts and video lectures to their work, which I am strongly in favor of.

Edit: Tempered certainty about custom links on abstract pages.

Edit 2: Added examples of new citations as useful paper modifications.

Edit 3: Paperbricks is another idea, with video here. Here is jsweojtj’s description:

There is an awful lot of redundancy and wasted effort that goes into most papers. From introductions that need to be rewritten every time (when linking to a solid introduction would be both better and less time consuming). Each piece of a piece of a full paper (intro, data, analysis, …) could be peer-reviewed and published individually. A full paper could then be built from these paper-bricks. Anyway, recommend reading the paper as it’s well written and clear.

More motivation suggested by pickle27:

A paper could rely on a critical reference to build upon and the referenced paper could be disproven down the line but this is not immediately obvious from the paper that used it.

Currently it doesn’t seem like any merit is given to researchers who are very good at reviewing papers. Compare this to software where a good code review is celebrated. Editing and cleaning up the state of science should be valued when scientists are looking for work so I think that something along the line of a Github CV for scientists would be valuable.

Edit 4: See Hessam Mehr’s comment below for more possible advantages of collaborative documents.

Edit 5: Many commenters have helpfully reminded me of these existing collaborative editing tools that we could look to as examples: Authorea, ShareLaTeX, Overleaf (formerly WriteLaTeX), and FindusWriter. See this Nature News article for discussion. However, none of these allow for universal collaboration or a plausible path to credible and permanent attribution, so even if they succeeded they mostly wouldn’t answer the question of what the central repository (or other arXiv successor) would look like. And that’s the key part.

Likewise, see SciRate, ThinkLab, and PubPeer for discussion/annotation of articles, but without distillation toward a refined document. And see Force11 for some interesting but non-committal discussion about how future scientific works should be produced (h/t hyperion2010).

Edit 6: coliveira makes a good point that the initial presentation and discussion of results is not suitable for a wiki, since it will always be necessary to give the author some stable platform upon which to present some of their tentative work, and likewise for the tentative discussion by critics. So some form of ownership will be necessary for some formats. However, this platform could often be much more modest than a full journal article for many of the incremental results currently found in journals.

Edit 7: The Stacks Project is a neat example of a large collaborative document with a dedicated website. (H/t tobilehman.) I can’t assess it’s mathematical quality, but it looks very professional at a quick glance. This is more evidence, like Authorea and its siblings, that collaborative tools are actually fairly well developed these days. So that’s probably not the sticking point.

Edit 8: Bas Spitters: “With two dozen researchers we collaboratively wrote a 600 page book [the Homotopy Type Theory textbook] in less than half a year using github”. The book and relevant blog post. Amazing. I still don’t understand how they managed to convince people to do this.

Edit 9: See CodaLab for collaborative tools for computational research and CaseText for crowdsourced forward-citation and annotation system for the law. (H/t graphific and jacech.)

Edit 10: Discrete Analysis is a new arXiv overlay journal launched by Fields medalist Timothy Gowers. (Nature News coverage. H/t Ivar Martin below.) There is also much to be learned from the Stanford Encyclopedia of Philosophy, especially the funding model. Note also that the arXiv administration board is toying with the idea of giving authors the ability to link to external material.

Edit 11: I was recently enlightened to the fact that the Stacks Project isn’t nearly as collaborative as I had thought, with the large majority of the 5000+ page document written by Aise Johan de Jong of Columbia.

Edit 12: Here’s an example of overlay commenting from Fermat’s library, which highlights a paper every week for community commenting. Fun, but not intended to be scalable.

Edit 13: joelg from HN says this:

I’m working at the MIT Media Lab on PubPub (http://www.pubpub.org), a free platform for totally open publishing designed to solve a lot of these problems: One is peer review, which, as some have already mentioned, needs to be done in an open, ongoing, and interactive forum. Making peer review transparent to both parties (and the public) makes everyone more honest.

Another is the incentive of publication itself as the ultimate goal. Instead, we need to think of documents as evolving, growing bodies of knowledge and compilations of ongoing research. Every step of the scientific process is important, yet most of it is flattened and compressed and lost, like most negative results, which are ditched in search of sexy click-bait headliner results.

Another is the role of publishers as gatekeepers and arbiters of truth. We need a medium in which anyone can curate a journal, and in which submission, review, and acceptance procedures are consistent and transparent.

Another is the nature of the medium itself. It’s 2016, and these dead, flat, static PDFs are functionally identical to the paper they replaced! Insert your favorite Bret Victor/Ted Nelson rant here: we need modern, digitally-native documents that are as rich as the information they contain.

Another is reproduciblity. We should be able to see the code that transformed the raw dataset, tweak it, and publish our own fork, while automatically keeping the thread of attribution.

Edit 14: Last year Tobias Osborne started tooling around with GitHub to write a collaborative paper on “What is a quantum field state?“.

Edit 15: Some two years after this post appeared, Overleaf has consumed ShareLaTeX. Here’s a recent comparison of Overleaf and Authorea.

Edit 16: Other notable developments: (1) The online textbook “Real World Haskell” allows comments on each section to get reader feedback, a cool feature I hadn’t seen implemented before. See here for thoughts from one of the authors near the completion of the book: “We have received 7153 comments so far. That’s an average of 1.73 comments per paragraph.” The book appears to have been published and now the website has stagnated, but the code for the book and the website is still available. (H/t Philip Goyal and John Goerzen.) (2) The Distill journal is focused on pedagogy in machine learning. Each article appears on GitHub so you can make pull request (although it’s pretty cumbersome compared to Wikipedia). See the editors’ manifesto and this article by Michael Nielsen and Shan Carter.

Edit 17: UpVote.pub, a minimalist but as-yet unsuccessful SciRate competitor.

Edit 17: The authors of “ChaosBook”, available for free on line, answer the question Why a webbook?.