The rise of computational science has led to unprecedented opportunities for scientific advance. Ever more powerful computers enable theories to be investigated that were thought almost intractable a decade ago, robust hardware technologies allow data collection in the most inhospitable environments, more data are collected, and an increasingly rich set of software tools are now available with which to analyse computer-generated data.

However, there is the difficulty of reproducibility, by which we mean the reproduction of a scientific paper’s central finding, rather than exact replication of each specific numerical result down to several decimal places. We examine the problem of reproducibility (for an early attempt at solving it, see ref. 1) in the context of openly available computer programs, or code. Our view is that we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computation, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility.

At present, debate rages on the need to release computer programs associated with scientific experiments2,3,4, with policies still ranging from mandatory total release to the release only of natural language descriptions, that is, written descriptions of computer program algorithms. Some journals have already changed their policies on computer program openness; Science, for example, now includes code in the list of items that should be supplied by an author5. Other journals promoting code availability include Geoscientific Model Development, which is devoted, at least in part, to model description and code publication, and Biostatistics, which has appointed an editor to assess the reproducibility of the software and data associated with an article6.

In contrast, less stringent policies are exemplified by statements such as7 “Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.” Although Nature’s broader policy states that “...authors are required to make materials, data and associated protocols promptly available to readers...”, and editors and referees are fully empowered to demand and evaluate any specific code, we believe that its stated policy on code availability actively hinders reproducibility.

Much of the debate about code transparency involves the philosophy of science, error validation and research ethics8,9, but our contention is more practical: that the cause of reproducibility is best furthered by focusing on the dissection and understanding of code, a sentiment already appreciated by the growing open-source movement10. Dissection and understanding of open code would improve the chances of both direct and indirect reproducibility. Direct reproducibility refers to the recompilation and rerunning of the code on, say, a different combination of hardware and systems software, to detect the sort of numerical computation11,12 and interpretation13 problems found in programming languages, which we discuss later. Without code, direct reproducibility is impossible. Indirect reproducibility refers to independent efforts to validate something other than the entire code package, for example a subset of equations or a particular code module. Here, before time-consuming reprogramming of an entire model, researchers may simply want to check that incorrect coding of previously published equations has not invalidated a paper’s result, to extract and check detailed assumptions, or to run their own code against the original to check for statistical validity and explain any discrepancies.

Any debate over the difficulties of reproducibility (which, as we will show, are non-trivial) must of course be tempered by recognizing the undeniable benefits afforded by the explosion of internet facilities and the rapid increase in raw computational speed and data-handling capability that has occurred as a result of major advances in computer technology14. Such advances have presented science with a great opportunity to address problems that would have been intractable in even the recent past. It is our view, however, that the debate over code release should be resolved as soon as possible to benefit fully from our novel technical capabilities. On their own, finer computational grids, longer and more complex computations and larger data sets—although highly attractive to scientific researchers—do not resolve underlying computational uncertainties of proven intransigence and may even exacerbate them.

Although our arguments are focused on the implications of Nature’s code statement, it is symptomatic of a wider problem: the scientific community places more faith in computation than is justified. As we outline below and in two case studies (Boxes 1 and 2), ambiguity in its many forms and numerical errors render natural language descriptions insufficient and, in many cases, unintentionally misleading.

Box 1: The United Kingdom Meteorological Office produces (in conjunction with the University of East Anglia’s Climatic Research Unit) the downloadable and widely used gridded temperature anomaly data sets known as HadCRUT and CRUTEM3. Yet even such a high-profile data set, developed by an organization with a good standard of software development34, contained errors that would have been more quickly identified and rectified had the underlying code been readily available. In 2009, on examining the available data sets and the description of the algorithm35, J.G.-C. identified a number of errors (the software he used to check the meteorological database is available upon request). One set of errors was procedural, and involved incorrect computation of historical average temperatures in a number of records in New Zealand and Australia. The Meteorological Office confirmed the errors, showed that they had resulted in errors up to 0.2 °C (either warmer or cooler) in the average temperature for Australia and New Zealand in some years before 1900, and issued an update to CRUTEM3. Two other errors occurred in the coding of the calculation of station errors (an estimate of the error in any average temperature reading). When corrected, a minor reduction in station errors resulted, improving the accuracy of the data. So, although these implementation problems did not lead to serious errors in the temperature data sets, they highlight the difficulty of translating a natural-language description (even with some formulae expressed mathematically) into code. These errors do not in any way reflect badly on the original authors. The code rewriting simply plays the part of peer review and it is normal to find such errors. Indeed, the discovery of such errors in ‘working’ software is exceedingly common in all computing, even when the software has been in use for a considerable time. This was emphatically demonstrated in a seminal IBM study36, demonstrating that fully a third of all the software failures in the study took longer than 5,000 execution years (execution time indicates the total time taken executing a program) to fail for the first time.