Read Time:

The Darwin Tree of Life (DToL) project is part of a global initiative to sequence all complex life on Earth, a mission known as the Earth BioGenome Project. The DToL project will sequence the genomes of all 60,000 eukaryotic organisms in the British Isles to better understand how DNA translates to the diversity of life.



The project, led by the Wellcome Sanger Institute, is a collaborative effort that will bring together a variety of institutions, funding bodies, universities, museums and horticultural organizations. Once available, the data gathered from the project will be made open for researchers across the globe to access and utilize in their own research.



Technology Networks recently spoke with Fergal Martin, Vertebrate Annotation Coordinator at EMBL’s European Bioinformatics Institute (EMBL-EBI) to learn more about the development of the project, its aims and the challenges that lie in sequencing 60,000 organisms.



Molly Campbell (MC): What is the "sixth great extinction"?



Fergal Martin (FM): This is an ongoing pattern of extinction across a huge range of species that can be linked to human activity. Things like large-scale deforestation, damage to coral reefs, increased levels of pollution, the effect of humans on climate change and the world in general are accelerating the rate at which other species become extinct. Other mass extinctions have generally been linked to profound geological events, whereas this can be linked primarily to the direct and indirect actions of a single species, us.



MC: Please can you tell us about the development of the Darwin Tree of Life project? Who is involved?



FM: The DToL project is led by the Wellcome Sanger Institute. Our role at the EMBL-EBI is twofold. Firstly, we want to make the genome sequences that result from the project freely available through our database, the European Nucleotide Archive (ENA). The ENA will hold a permanent record of the data so that anyone will be able to find and analyze these genomes from the moment they are completed to many years from now.



Our second major contribution to the projects is through Ensembl, which is our online public resource for people who wish to analyze whole genomes. Ensembl provides a collection of analyses on the raw genome sequences (termed "annotations"), along with methods for visualizing the data through our genome browser and programmatic access to the data through our application programming interface.



As part of the annotation process, Ensembl computes crucial information such as where the genes are located, what their structures are and how the genome sequences of the different species compare to one another. These analyses help shortcut downstream science for researchers as we can run analyses in the space of a week that would take smaller research groups months or even years to complete. That way we can empower the research community to quickly start asking complex scientific questions based on the data.



In addition to the EMBL-EBI and the Wellcome Sanger Institute, other partners on the project include leading research organizations and universities (the Earlham institute, Marine Biological Association, Plymouth, University of Cambridge, University of Edinburgh, University of Exeter, University of Oxford), national collections (Natural History Museum, Royal Botanic Gardens, Kew, Royal Botanic Gardens, Edinburgh), outreach and engagement organizations (Connecting Science) and funding agencies (Wellcome, BBSRC). It really is an impressive collaborative effort!



MC: What are the key aims of the project?



FM: The goal of the project is to reconstruct the genomes of all 60,000 eukaryotic species in Britain and Ireland to make these data freely available to anyone with an interest, from the general public, to citizen scientists, to evolutionary biologists. This is a key part of a broader global effort to sequence the genomes of all life on the planet and will form an unparalleled resource for science.



To ensure that the data are made publicly available and rapidly annotated for genomic features, both the ENA and Ensembl are re-engineering their underlying processes to be as efficient and scalable as possible. To help us reach the overall goal of the project, we aim to create a smooth and efficient data flow over the next couple of years where the data producers can instantly pass the raw data to the ENA and Ensembl. From there, the genome data will be annotated and released to the community in as short a timeframe as possible.



MC: Which technologies have enabled this project to become a reality?



FM: Many factors have come together to make the project possible, including dramatic improvements in genome sequencing technologies, large reductions in the cost of sequencing, new algorithms that are significantly more efficient and the availably of effectively unlimited compute via the cloud.



At EMBL-EBI we have been working on improving our infrastructure in anticipation of projects of the scale of DToL for several years. A good example of this is the Ensembl gene annotation pipeline, which takes the DNA sequence of a species and then calculates the location and structure of the genes hidden within it. Not that long ago finding the genes was an intensively manual process. It would take someone working fulltime for three to six months to find the genes in a single species. Now it’s possible for a single person to spend five minutes configuring the annotation pipeline for 10 species to get the results returned to them a couple of weeks later.



To achieve such a fundamental shift in throughput we had to rebuild the entirety of the pipeline. Each component was analyzed in terms of how useful it was, how much it could be parallelized, how to improve error tolerance on it and how to best deploy the associate work onto a compute cluster.



If we wanted to fire on all cylinders at this point, we actually have more capacity for annotating the genes in these genomes than the rate at which they’re currently being produced. That being said, we also recognize there is plenty of room for improvement and just like other parts of the DToL chain, we will need to keep evolving and keep optimizing and automating to reach the end goals of the project.



MC: The project will collect, identify and extract and sequence DNA and RNA from approximately 60,000 species within Britain and Ireland. What challenges will you encounter in this process?



FM: There are many challenges that will arise over the course of the project. This includes everything from how to sample, extract and track the DNA to how to efficiently analyze the data to how to visualize and present the results back to the public.



For EMBL-EBI, our biggest challenge comes from the data analysis side of things. As these genomes are produced, how do we ensure that we are annotating the genomic features in a way that is as accurate and efficient as possible? That in itself is a challenge. It’s easy to do a bad job quickly, but that is not very helpful to the research community if the results are wrong and need to be recalculated. Equally, producing a perfect result is also not useful if it takes us a year to complete per species. So, we are always looking at how to best balance speed versus accuracy.



Another challenge is keeping track of all the developments in what is a rapidly changing field. In addition to updating and improving our own software and pipeline we also invest time into analyzing third party solutions to see if they are suitable for integration into our processes.



A final major aspect of what we do is working out how to optimize our data analysis code for different species. The underlying DNA of different species can vary in surprising ways. For example, salamanders can have over 10 times as much DNA as humans, bird genomes have very little repetitive DNA while mammals have lots of repeats and wheat has many copies of its chromosomes compared to the two copies seen in human. All these differences, in terms of the underlying data, could potentially break our pipelines or make them run much less efficiently. To counter this, we are always trying to better understand the underlying biology in order to make our software and pipelines more robust.



MC: The data from the project will be made openly available for reuse in biological research, conservation, biotechnology and beyond. What applications do you hope the data will have in these spaces?



FM: I think before dipping into the realms of potential applications it’s important to appreciate just how large a gap in our knowledge will be filled by a project like this. To date, there are approximately ten thousand eukaryotic genomes that have been digitally reconstructed and deposited in the public archives. These vary massively in quality, with many of the older genomes being effectively unusable for any sort of detailed scientific analysis. If we were to only consider high quality existing genomes, we’re definitely talking about generating at least an order of magnitude more high-quality genomes than have been created over the past 20or so years. That alone will fundamentally change how we understand the multiple fields in the biological sciences.



In terms of potential applications, there are many we know about and probably many more that we won’t realize until the project is well underway. From an ecological standpoint, we will be able to sequence and analyse all species in Britain and Ireland. As a result, we will have an unparalleled window into all ecosystems across both.



A great example of this is Wytham Woods. This has been maintained and studied by the University of Oxford since 1942 and is home to over 500 species of plants and 800 species of moths and butterflies (among many other things). There is already a vast ecological record for Wytham Woods and, as a result of DToL, we will be able to pair this record with a complete eukaryotic genomic record for the entire ecosystem. Something like this simply has never been done before. We’ll be able to really get an insight into the dynamics of an ecosystem on a molecular level. This could help us understand any genomic mechanisms that are linked to whether a species is flourishing or floundering and ultimately lead to decisions that help improve conservation practices both on the level of individual species and the level of the ecosystem itself.



Two other fields that will greatly benefit from these data are comparative genomics and evolutionary biology. The more species we have high quality genome sequences for, the more power we have in defining the key differences between these species. If we see that a species has some novel ability that we don’t understand, being able to compare it to many other species can really help isolate the parts of the genome that give rise to that novelty. Similarly, if we want to reconstruct the evolutionary history of genes or genomes, having data for as many species as possible allows us to better understand how things have evolved, what they looked like in extinct ancestral species and even how they might evolve in future.



The more insight we gain in this regard the better we will be able to understand how subtle differences lead to biological outcomes. This will be highly valuable to industries such as pharmaceuticals and biotechnology. One key question that often arises is how applicable experiments on a non-human model organism are when translated to human. The more we can understand the pattern of differences between a model organism and ourselves, the better we can model how these differences would affect any results from the experiments. For things like livestock and crops, we will potentially be able to gain insights into genes affecting yields and resistances that would help ensure future food security. It is likely that breakthroughs in agricultural genomics will save as many if not more lives than genomics medicine in the future.



As a closing note on the applications, when the Human Genome Project (HGP) was launched, it would be fair to say that nobody was sure exactly what the ultimate overall result would be in terms of translating what was a very large and expensive scientific endeavour into something with tangible real world applications. Now we can say that the HGP essentially transformed our understanding of human health. I have no doubt that the DToL project, along with other global sequencing efforts, will have an even more profound transformative effect on our understanding of life on Earth. At EMBL-EBI we want help ensure that these data are rapidly processed and presented to the research community so that we can see real world applications appear as quickly as possible.



Fergal Martin, Vertebrate Annotation Coordinator at the EMBL-EBI, was speaking with Molly Campbell, Science Writer, Technology Networks.