Setting journal policies takes time. When I believe I’ve settled on one, John Pham, the Editor-in-Chief of Cell, helps me think about it more clearly. A few months ago, I asked John, “Should a study be rejected simply because its n is very small?” I was expecting John to say yes, but instead, he turned my question on its head. The question he posed in return was, “Is a study ever so thought provoking that it needs to be published even if the n is very small?” John’s pointed, careful, and hopeful observations about these questions helped me recognize that there are good reasons to answer “yes” and “no” to both. Playing through them reveals a tension between being a protective custodian of science on one hand and facilitating progress on the other. Consistency, standards, and rigor are front of mind when you're a protective custodian, but flexibility, creativity, and exploration need to be supported in order to facilitate progress. Somehow, journal policy must both protect science and help it grow.

When Cell Systems launched three-and-a-half years ago, we required that the datasets reported in our papers be made freely available, but we weren’t confident that the community would support similar a policy about code. Since then, we’ve come to believe that transparently reporting code is not optional. We now require that code be made available (for more specific information, click here ). Given the current state-of-the-art, this is an appropriate standard. However, a higher standard has been defined by the NIH Data Commons. They hold the principle that data and code should be FAIR—that is, findable, accessible, interoperable, and reusable. The FAIR standards are admirable and forward-thinking. The idea of requiring that data and code be FAIR is on our horizon, but the technology and infrastructure needed to be FAIR aren't yet available widely enough to make that requirement reasonable at Cell Systems. We will consider whether it's time to require FAIRness annually. In the meantime, I'll discuss three reasons why I look forward to adopting the FAIR standards as soon as we're confident that all of our authors will have access to the technology and infrastructure they need to meet them. In discussing these reasons, I hope to illustrate why shifting thinking away from downloading and towards reuse is powerful.

First, focusing on reuse sharpens the problem of reproducibility in the biological sciences. Set aside, for a moment, very real difficulties stemming from variabilty in samples and reagents, and their confounding effects on our ability to exactly replicate experiments. In principle, data and a particular version of code should be static and computational environment should be defined and stable, so analyses should be exactly replicable. When they’re not, the issues are in the human world, not the natural one. That’s a transformative distinction.

An anecdote I heard recently: a research team uploaded a large dataset, which I’ll call version 1, into a publicly available database, and then downloaded it again, yielding version 2. A basic check revealed that version 1 could not be compared sensibly to version 2. Chasing down the sources of the discrepancies was an odyssey. The team found batch effects in their own data that had previously been cryptic despite their best efforts and good work. They learned that measurement standards are applied in non-standard ways across their community and, likewise, that “standard operating procedures” aren’t put into reliable practice. Their experience demonstrated that ontologies are well-constructed, curated, and applied in some parts of the field, but not others. These problems are large, but they’re not intrinsic to biology. They’ve been solved in other parts of science and society. Concrete solutions exist; they can be learned, adapted, and applied.

Another reason that focusing on the reuse of data and code is important is that it prompts biologists to make use of transformative technologies developed by computer scientists and trained software engineers. Even comparatively well-established and ground-level tools, like Jupyter notebooks and Docker, go a long way toward facilitating good practices and reproducibility. Good practices are a prerequisite for rigorously concatenating diverse biological datasets or borrowing powerful approaches from the emerging field of data science and using them responsibly. The prospect of “big data” is exciting, but it can only be as good as the scientific practices it's built on.

Finally, I’ll return to the thought puzzle that John Pham posed: “Is a study ever so thought-provoking that it needs to be published, even if the n is very small?” It surprises me that my answer is yes—but only if the data can be re-used.

It is quite tempting to assume that studies relying on very small n’s aren’t legitimate. However, there are other ways to demonstrate that striking observations aren’t simply wishful thinking, and a rigorous framework for understanding today’s striking observations may be developed years from now. Science shouldn’t wait, but we should have the foresight to ensure that true and rigorous tests of future frameworks can be made. True tests will require interrogation—that is, reuse—of the original data. This is a third reason that ensuring reuse is important: it preserves today’s data for tomorrow’s ideas.

I look forward to annual conversations about whether it is time to require that code be FAIR or data be FAIR or both. Moreover, I’m grateful to the NIH Data Commons for providing principles that will help Cell Systems both facilitate progress and be a protective custodian of science.