By Yannick Wurm, Lecturer in Bioinformatics, Queen Mary University of London.

Biology is a data science

The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high-performance computing.

This is exciting and empowering - in particular for small teams working on emerging model organisms that lacked genomic resources. But with great power come great responsibility... and risks that things could go wrong.

These risks are far greater for genome biologists than say physicists or astronomers who have strong traditions of working with large datasets. This is because biologist Researchers generally learn data handling skills ad hoc and have little opportunity to gain knowledge of best practices. Biologist Principal Investigators - having never themselves handled huge datasets - have difficulty in critically evaluating the data and approaches. New data are often messy with no standard analysis approach, even so-called standard analysis methodologies generally remain young or approximative. Analyses that are intended to pull out interesting patterns (e.g. genome scans for positive selection, GO/gene set enrichment analyses) will enrich for mistakes or biases in the data. Data generation protocols are immature & include hidden biases leading to confounding factors (when things you are comparing differ not only according to the trait of interest but also in how they were prepared) or pseudoreplication (when one independent measurement is considered as multiple measurements).

Invisible mistakes can be costly

Crucially, data analysis problems can be invisible: the analysis runs, the results seem biologically meaningful and are wonderfully interpretable, but they may in fact be completely wrong.

Geoffrey Chang's story is an emblematic example. By the mid-2000s he was a young superstar professor crystallographer, having won prestigious awards and published high-profile papers providing 3D-structures of important proteins. For example:

Science (2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters.

(2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters. Journal of Molecular Biology (2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation.

(2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation. Science (2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide.

(2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide. Science (2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate. 310:1950-1953.

(2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate. 310:1950-1953. PNAS (2004) Ma & Chang Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli.

But in 2006, others independently obtained the 3D structure of an ortholog to one of those proteins. Surprisingly, the orthologous structure was essentially a mirror-image of Geoffrey Chang's result. After rigorously double-checking his scripts, Geoffrey Chang realized that: "an in-house data reduction program introduced a change in sign [..,]". In other words, a simple +/- error led to plausible and highly publishable - but dramatically flawed - results. He retracted all five papers.

This was devastating for Geoffrey Chang, for his career, for the people working with him, for the hundreds of scientists who based follow-up analyses and experiments on the flawed 3D structures, and for the taxpayers or foundations funding the research. A small but costly mistake.

Approaches to limit the risks

A +/- sign mistake seems like it should be easily detectable. But how do you ensure that experiments yield correct results even when they require complicated data generation and complex analysis pipelines with interdependencies and sophisticated data structures?

We can take inspiration from software developers in internet startups: similarly to academic researchers, they form small teams of qualified people to do great things with new technologies. Their approaches for making robust software can help us make our research robust.

An important premise is that humans make mistakes. Thus almost any analysis code will include mistakes - at least initially. This includes Unix commands, R, perl/python/ruby/node scripts, etc. To increase the robustness of our analyses we must become better at detecting mistakes - and ensure that we make fewer mistakes in the first place.

More robust code

There are many approaches to creating more robust code. Every additional chunk of code can contain additional mistakes. The less code you write, the fewer mistakes you can make. To that end, it is good practice to reuse established code - your own, or that of others, libraries, etc.

Every subset, function, method of every piece of code should be tested on test data (edge cases) to ensure that results are as expected (see unit and integration testing). It's advisable to write the test datasets and tests even before writing analysis code. Continuous integration involves tests being automatically rerun (almost) instantly whenever a change is made anywhere in the analysis. This helps detect errors rapidly before the analysis is begun.

Style guides define formatting and variable naming conventions (e.g. for ruby or R). This makes it easier for you to understand what you did when you have to return to an analysis years later (e.g. for paper revisions or a new collaboration), and for others to reuse and potentially improve your code. Tools can be used to automatically test whether your code is in line with the style guide (e.g. RLint, Rubocop, PyLint).

Rigorously tracking data and software versions, and sticking to them, reduces risks of unseen incompatibilities or inconsistencies. A standardised project structure can help.

Code reviews are very important. Having someone else look over your code, either by showing it to them in person or by making it open source, will help you learn how to improve your code structure and detect mistakes. Ultimately, a review will ensure that the code will be reusable by yourself and by others. There are specialists who have years of experience in preventing and detecting mistakes in code or analyses. We should hire them. Having people independently reproduce analyses using independent laboratory and computational techniques on independently obtained samples might be the best validation overall.

This advice overlaps at least in part with what has been written elsewhere and my coursework material. My lab do our best to follow best practices for the bioinformatics tools we develop and research on social evolution.

Additionally, the essentials of experimental design are long established: ensuring sufficient power, avoiding confounding factors and pseudoreplication (see above and elsewhere), and using appropriate statistics. Particular caution should be used with new technologies as they include sources of bias that may not be immediately obvious (e.g. Illumina lane, extraction order...).

There is hope

There is no way around it: analysing large datasets is difficult.

When genomics projects involved tens of millions of dollars, much of this went to teams of dedicated data scientists, statisticians and bioinformaticians who could ensure data quality and analysis rigor. As sequencing got cheaper the challenges and costs have shifted even further towards data analysis. For large scale human resequencing projects this is well understood. The challenges are even greater for organisms with only few genomic resources. Surprisingly many Principal Investigators, researchers and funders who focus on such organisms assume that individual researchers with little formal training will be able to perform all the necessary analysis. This is worrying and suggests that those new to large datasets underestimate how easily mistakes with major negative consequences occur and go undetected. We may have to see additional publication retractions for the awareness of the risks to fully take hold.

Thankfully, multiple initiatives are improving visability of the data challenges we face, and for biology in particular (e.g. 1, 2, 3, 4, 5, 6). This visibility of the risks - and the relative ease of implementing practices that will improve research robustness - needs to grow among funders, researchers, Principal Investigators, journal editors and reviewers. It will ultimately help more people to do better, more trustworthy science that will not need to be retracted.

Acknowledgements

This blog post came together thanks to the Institute's Collaborations workshop, Bosco K Ho's post on Geoffrey Chang, discussions in my lab and through interactions with colleagues at the social insect genomics conference and the NESCent Genome Curation group. Yannick Wurm is funded by the Biotechnology and Biological Sciences Research Council [BB/K004204/1], the Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and is a fellow of the Software Sustainablity Institute.