“We got so good at producing data,” says Nicole Wheeler, a data scientist at the Sanger Institute who looks at the genomes of pathogenic bacteria, “that we ended up with too much of it.” McVean agrees. “In Moore’s Law, the computing power you have doubles every 18 months,” he says. “The growth of biomedical data capture – through sequencing genomes, but also through medical imaging or digital pathology – is much faster than that. We’re super-Moore’s-Law-ing in biomedical data.”

It became completely impossible, in the early years of this century, for biological scientists to check their data themselves. And this meant that biologists had to recruit, or become, data scientists.

“We reached a bottleneck a few years ago,” says Anne Corcoran. “We had lots of data, but we didn’t know what to do with it. So algorithms had to be invented on the fly, to deal with the data and maximise it,” she continues. “When you’re looking at single genes, or a few, you can do it manually, but when you’re looking at the expression of 20,000 genes, you can’t even do the statistics by yourself.”

Biologists – many of whom grew up, as Corcoran did, working on benches with glassware, not desks and laptops – have had to learn to use these algorithms. “I think senior scientists are often intimidated by it,” she says, “and more reliant on their junior colleagues than they probably should be, or would like to admit that they are.”

She’s evolved a “working knowledge” of how these algorithms function, but admits that “it’s a slightly vulnerable period, where the people at the top don’t have the skills to check the work of the people beneath them”.

Wolf Reik, one of Corcoran’s colleagues at the Babraham Institute, who runs a research team looking at epigenetics, agrees. Older scientists have a completely different mindset, he says. “It’s quite funny – my staff in lab meetings think in terms of what the genome as a whole does. But I think about single genes and generalise from them – that’s how I learned to think.”

It’s important for people in his position, he says, to understand junior scientists’ work, and “most importantly develop an intuition about how to use the tools… because ultimately I put my name to the work”.

The younger scientists, on the other hand, have grown up with data. Some of them have come from that background – Gerstung did a physics undergraduate degree – although that’s true of some group leaders as well, such as McVean. But others who came through a more biological route have ended up talking in terms of coding. “I did biology as an undergrad, that’s my domain knowledge,” says Na Cai, a postdoctoral researcher at the Sanger Institute who studies how genotypes relate to various human traits.

“Now I’m doing statistical analysis every day. It’s been like learning another language, or several,” she says. “I had to switch my brain from thinking in terms of biochemical pathways and flowcharts to a more structured kind of thinking in terms of code.”

The senior scientists she works with have all been “quite good at keeping up with the latest developments,” she says. “They might not be able to write the code, but they understand what the analysis does.”

Wheeler, a colleague of Cai’s, also came through the biology route and ended up coding. “I don’t have a traditional software-engineering background,” she says. “I learned to code on the side, during my PhD. [My coding] isn’t the most efficient or glamorous, but it’s about seeing what you have to do computationally and making it happen.”

In response to these needs, undergraduate degrees have been changing in the last few years. Newcastle University, for instance, now has a bioinformatics module in its biology undergraduate course, and Reading’s final-year research projects involve computational biology, although the earlier optional computing modules have a low take-up, so students in their final year are learning the skills last-minute. Imperial College London, which already has bioinformatics courses, is planning to add programming for first- and second-years. “I think there’s a recognition that biology involves more data than we used to have,” says Wheeler, “so people need to have the skills to process it.”

But the change is slow, and sometimes opposed by students, not all of whom got into biology to code. “I’d say some undergrad courses are catching up,” says Corcoran. “But in general they have not, as exemplified by the proliferation of post-degree Master’s courses teaching these skills.”

The change is necessary, though. Even the most wet-lab-oriented scientists interviewed said they spend less than 50 per cent of their time doing experiments; some said it was as little as 10 per cent or even, in Cai’s case, none at all since she has become a full-time bioinformatician.

The shift towards being data-driven, says Wheeler, can be seen as a move from science that’s hypothesis-testing to one that’s hypothesis-generating. One scientist, who preferred not to put their name to the concern, worried that it had reduced the creativity in science, but according to Wheeler that’s not the case. “It’s moved the creativity around,” she says. “In some ways there’s more room for creativity. You can really try out some crazy ideas at relatively low cost.”

This has other advantages. “You can become attached to hypotheses,” says Matt Bawn, a bioinformatician at the Earlham Institute, a computational biology research centre in Norfolk, UK. “It’s better to be a disinterested observer with no preconceptions, to look at the blank canvas and let the picture emerge.”

But the greatest benefit is that data-driven studies are throwing up fascinating new findings all the time, in complex areas that were previously impossible to study.