Understanding the origin and evolution of the eukaryotic cell and the full diversity of eukaryotes is relevant to many biological disciplines. However, our current understanding of eukaryotic genomes is extremely biased, leading to a skewed view of eukaryotic biology. We argue that a phylogeny-driven initiative to cover the full eukaryotic diversity is needed to overcome this bias. We encourage the community: (i) to sequence a representative of the neglected groups available at public culture collections, (ii) to increase our culturing efforts, and (iii) to embrace single cell genomics to access organisms refractory to propagation in culture. We hope that the community will welcome this proposal, explore the approaches suggested, and join efforts to sequence the full diversity of eukaryotes.

Genome sequencing is a powerful tool that helps us to understand the complexity of eukaryotes and their evolutionary history. However, there is a significant bias in eukaryotic genomics that impoverishes our understanding of the diversity of eukaryotes, and leads to skewed views of what eukaryotes even are, as well as their role in the environment. This bias is simple and widely recognized: most genomics focuses on multicellular eukaryotes and their parasites. The problem is not exclusive to eukaryotes. The launching of the so-called ‘Genomic Encyclopedia of Bacteria and Archaea’ [] has begun to reverse a similar bias within prokaryotes, but there is currently no equivalent for eukaryotes. Targeted efforts have recently been initiated to increase the breadth of our genomic knowledge for several specific eukaryotic groups, but again these tend to focus on animals [], plants [], fungi [], their parasites [], or opisthokont relatives of animals and fungi []. Unfortunately, a phylogeny-driven initiative to sequence eukaryotic genomes specifically to cover the breadth of their diversity is lacking. The tools already exist to overcome these biases and fill in the eukaryotic tree, and we therefore hope that researchers will be inspired to explore these tools and embrace the prospect of working towards a community-driven initiative to sequence the full diversity of eukaryotes.

Eukaryotes are the most complex of the three domains of life. The origin of eukaryotic cells and their complexity remains one of the longest-debated questions in biology, famously referred to by Roger Stanier as the ‘greatest single evolutionary discontinuity’ in life []. Thus, understanding how this complex cell originated and how it evolved into the diversity of forms we see today is relevant to all biological disciplines including cell biology, evolutionary biology, ecology, genetics, and biomedical research. Progress in this area relies heavily on both genome data from extant organisms and on an understanding of their phylogenetic relationships.

The ‘multicellular bias’ is the most serious, but is not alone. The eukaryotic groups with most species deposited in culture collections and/or genome projects are also biased towards either those containing mainly phototrophic species or those that are parasitic and/or economically important ( Figure 2 ). For example, both Archaeplastida and Stramenopila have more cultured species than other eukaryotes as a result of a long phycological tradition and the well-provided phycological culture collections [], and also because they are easier to maintain in culture than heterotrophs. In both cases this translates to a comparatively large number of genome projects: several genomic studies target photosynthetic stramenopiles [] and, owing to their economic relevance in the agriculture, the peronosporomycetes []. In addition, the apicomplexans within the Alveolata are also relatively well studied at the genomic level because they contain important human and animal parasites [] such as Plasmodium and Toxoplasma. If we look instead at the number of sequenced strains rather than species, these biases are increased further ( Figure 3 ). As a result, a significant proportion of the retrieved cultures and genomes correspond to different strains of the same dominant species. Therefore, we have a pool of species that have been redundantly cultured and sequenced.

a Some strains are not described at the species level and have been grouped by genus. Therefore they may represent more than a single species.

Eukaryotic diversity distribution among the analyzed databases. (A) The 25 species a with the most strains represented in the analyzed culture collections. (B) The 25 species a with the most ongoing genome projects. (C) The 25 most abundant SAGs OTU 97 in the analyzed dataset. Abbreviations: MAST, marine stramenopile; OTU 97 , operational taxonomic unit (>97% sequence identity); SAG, single amplified genome.

Figure 3 Eukaryotic diversity distribution among the analyzed databases. (A) The 25 species a with the most strains represented in the analyzed culture collections. (B) The 25 species a with the most ongoing genome projects. (C) The 25 most abundant SAGs OTU 97 in the analyzed dataset. Abbreviations: MAST, marine stramenopile; OTU 97 , operational taxonomic unit (>97% sequence identity); SAG, single amplified genome.

Relative representation of eukaryotic supergroup diversity in different databases. (excluding metazoans, fungi, and land plants).Percentage of described species per eukaryotic supergroup according to the CBOL ProWG.Percentage of 18S rDNA OTUper eukaryotic supergroups in GenBank.Percentage of environmental 18S rDNA OTUper eukaryotic supergroups.Percentage of species with a cultured strain in any of the analyzed culture collections. Culture data are from five large protist culture collections (n = 3084) (the American Type Culture Collection, Culture Collection of Algae and ProtozoaRelative numbers of species with a genome project completed or in progress according to GOLD, per eukaryotic group. Data from panels A–C are from, operational taxonomic unit (>97% sequence identity).

It is not surprising that the first and main bias in the study of eukaryotes arises from our anthropocentric view of life. More than 96% of the described eukaryotic species are either Metazoa (animals), Fungi, or Embryophyta (land plants) [] ( Figure 1 A) – which we call the ‘big three’ of multicellular organisms (even though the Fungi also include unicellular members such as the yeasts). However, these lineages only represent 62% of the 18S rDNA (see Glossary ) Genbank sequences ( Figure 1 B), which is of course a biased sample, or 23% of all operational taxonomic units (OTUs) in environmental surveys ( Figure 1 C). This bias is not new; research has historically focused on these three paradigmatic eukaryotic kingdoms, which are indeed important, but are also simply more conspicuous and familiar to us. In genomics this bias is amplified considerably: 85% of the completed or projected genome projects {as shown by the Genomes OnLine Database (GOLD) []} belong to the ‘big three’ ( Figure 1 D). Moreover, even within these groups there are biases. For example, many diverse invertebrate groups suffer from a lack of genomic data as keenly as do microbial groups. This makes for a pitiful future if we aim to understand and appreciate the complete eukaryotic tree of life. If we do not change this trend we risk neglecting the majority of eukaryotic diversity in future genomic or metagenomic-based ecological and evolutionary studies. This would provide us with a far from realistic picture.

Relative representation of metazoans, fungi, and land plants versus all the other eukaryotes in different databases.Relative numbers of described species according to the CBOL ProWG (n = 2 001 573).Relative numbers of 18S rDNA OTUin GenBank (n = 22 475).Relative number of environmental 18S rDNA OTUin GenBank (n = 1165).Relative number of species with a genome project completed or in progress according to GOLD, per eukaryotic group (n = 1758). Data in panels A–C are from, operational taxonomic unit (>97% sequence identity).

Although we lack an incontrovertible, detailed phylogenetic tree of the eukaryotes, a consensus tree is emerging thanks to molecular phylogenies []. The five monophyletic supergroups of eukaryotes are summarized in Box 1 . The distribution of cultured and sequenced species over the tree provides a broad overview of our current knowledge of eukaryotic diversity ( Figure 4 ). However, a quarter of the represented lineages lack even a single culture in any of the analyzed culture collections and, notably, 51% of them lack a genome. The most important gaps are within the Rhizaria, the Amoebozoa, and the Stramenopila, where many lineages are still underrepresented. However, many other lineages that lack any representative genome sequence are also found in the relatively well-described Opisthokonta and Excavata groups. This map is likely to be incomplete because several genome projects may not be reflected in the GOLD database, and because many cultures are not deposited in culture collections, but the overall trends probably afford an accurate representation of the biases we currently face.

The tree of eukaryotes, showing the distribution of current effort on culturing, genomics, and environmental single amplified genome (SAG) genomics for the main protistan lineages. Eukaryotic schematic tree representing major lineages. Colored branches represent the seven main eukaryotic supergroups, whereas grey branches are phylogenetically contentious taxa. The sizes of the dots indicate the proportion of species/OTUin each database. Culture data are from the analyzed publicly available protist culture collections (n = 3084). Genome data were extracted from the Genomes OnLine Database (GOLD) (n = 258)correspond to those retrieved during the Tara Oceans cruise (n = 158) (M.E.S., unpublished data). Taxonomic annotation of all datasets is based on, operational taxonomic unit (>97% sequence identity).

this is a diverse group of mostly heterotrophic unicellular eukaryotes including both amoeboid and flagellate forms []. Two iconic protist groups, Haeckel's Radiolaria and the Foraminifera, are members of the Rhizaria. Foraminifera have been very useful in paleoclimatology and paleoceanography due to their external shell that can be detected in the fossil record.

a widespread group of unicellular eukaryotes that have adopted diverse life strategies such as predation, photoautotrophy, and intracellular parasitism []. They include some environmentally relevant groups such as the Syndiniales, the Dinoflagellata, and the ciliates (Ciliophora), as well as the Apicomplexa group that contains notorious parasites such as Plasmodium sp. (the agent of malaria), Toxoplasma sp. (the agent of toxoplasmosis), and Cryptosporidium sp.

also known as heterokonts, the stramenopiles include a wide range of ubiquitous phototrophic and heterotrophic organisms []. Most are unicellular flagellates but there are also some multicellular organisms, such as the giant kelps. Other relevant members of the Stramenopila are the diatoms (algae contained within a silica cell wall), the chrysophytes (abundant in freshwater environments), the MAST (marine stramenopile) groups (the most abundant microbial predators of the ocean), and plant parasites such as the Peronosporomycetes.

three groups that have been historically studied separately. Phylogenetic analyses, however, have shown that those three groups share a common ancestor, forming a supergroup known as SAR []. This eukaryotic assemblage comprises the highest diversity within the protists.

the opisthokonts include two of the best-studied kingdoms of life: the Metazoa (animals) and the Fungi. Recent phylogenetic and phylogenomic analyses have shown that the Opisthokonta also include several unicellular lineages []. These include the Choanoflagellata (the closest unicellular relatives of the animals) and the Ichthyospora (that include several fish parasites that impact negatively on aquaculture).

the group Excavata was proposed based of shared morphological characters [], and was later confirmed through phylogenomic analyses []. Most members of this group are heterotrophic organisms, among them some well-known human parasites such as Trichomonas vaginalis (the agent of trichomoniasis) and Giardia lamblia (the agent of giardiasis), as well as animal parasites such as Leishmania sp. (the agent of leishmaniasis) as well as Trypanosoma brucei, and Trypanosoma cruzi (the agents of sleeping sickness and Chagas disease respectively).

also known as ‘the green lineage’ or Viridiplantae, this group comprises the green algae and the land plants. The Archaeplastida is one of the major groups of oxygenic photosynthetic eukaryotes []. Green algae are diverse and ubiquitous in aquatic habitats. The land plants are probably the most dominant primary producers on terrestrial ecosystems. Both green algae and land plants have historically played a central role in the global ecosystem.

this group consists of amoeboid organisms, most of them possessing a relatively simple life cycle and limited morphological features, as well as a few flagellated organisms []. They are common free-living protists inhabiting marine, freshwater, and terrestrial environments. Some well-known amoebozoans include the causative agent of amoebiasis (Entamoeba histolytica) and Dictyostelium sp., a model organism used in the study of the origin of multicellularity.

Thanks to molecular phylogenetics, to ultrastructural analyses, and to the efforts of many researchers, we have in recent years advanced significantly our understanding of the tree of eukaryotes. According to the most recent consensus taxonomy [], the eukaryotes can be divided into five monophyletic supergroups. We here introduce these supergroups, detailing some specific features of each.

Given the potential of SAGs to improve further our understanding of eukaryotic diversity, an important question to ask is whether high-quality genome data can be acquired from SAGs []. Currently, there seems to be a diversity of outcomes when using SAGs owing to the bias introduced by the whole-genome amplification procedure. The completeness range of the retrieved genome varies from less than 10% to a complete genome, and depends on the intrinsic properties of the cell studied as well as on the amplification method []. Culture certainly provides a more reliable way to obtain a genome of high quality at present, and a species in culture also provides researchers with a direct window to the biology of the organism and post-genomic research. Auto-ecological experiments, ultrastructure analyses, and even functional experiments can all be performed in culture, thereby providing a deeper context for the genome and the organism. However, in light of the lack of data we currently face, and the unlikelihood that a significant increase in resources for cultivation will soon appear, we argue strongly that genomic sequencing of SAGs is an important complement to culture-based research in furthering our understanding of eukaryotic diversity.

A complementary option to increase the breadth of eukaryotic genomics is to use single cell genomics (SCG) []. Although the technology is still developing, this is probably the best way we have today to retrieve genomic information from abundant microbial eukaryotes that are ecologically relevant but are refractory to being cultured. For example, the single amplified genomes (SAGs) from different global oceanic sites obtained during the Tara Oceans cruise (M.E.S., unpublished data) fill reasonably well the culture and genomic gaps that some of the most abundant groups in the oceans suffer from ( Figure 4 ). In particular, a significant fraction of the SAGs correspond to uncultured organisms such as the marine stramenopiles MAST-4 and MAST-7 [], chrysophyte groups H and G [], and the Syndiniales []. Importantly, sequence tagging shows that only 10% of the SAGs are present in any culture collection, and only 2.5% have an ongoing genome project (based on cultured taxa). It is worth mentioning that the SAGs so far available represent only marine microeukaryotes. Thus, although the analyzed SAGs certainly overcome part of the bias, they do not cover the full diversity of eukaryotes.

Although there may not be bad choices when selecting organisms for genome sequencing, there are certainly better choices if we aim to understand eukaryotic diversity. We argue that at least some of the effort should be specifically directed towards filling the gaps in the eukaryotic tree of life, focusing on those lineages that occupy key phylogenetic positions. How can that be done? One option is to sequence more cultured organisms. In fact, 95% of protist species in culture are not yet targeted for a genome project ( Figure S1 in the supplementary data online ). Thus, by obtaining the genome of some available cultured lineages that have not yet been sequenced, we could easily fill some of the important gaps of the tree, including some heterotrophic Stramenopila, Amoebozoa, and Rhizaria. However, selecting species that are available in culture is itself strongly biasing, and most lineages remain without any cultured representative []. Publicly accessible protist collections [such as the American Type Culture Collection (ATCC) and the Culture Collection of Algae and Protozoa (CCAP); summarized in Box 2 ] are considerably smaller than their bacterial or fungal counterparts. Among the reasons is the lack of a required, systematic deposit of newly described taxa, in contrast to the situation for bacteria []. Notably, and unfortunately, half of the species with genome projects completed or in progress are not deposited in any of the five analyzed publicly accessible culture collections. To avoid more ‘lost cultures’ in the future the community should establish and adopt standard procedures similar to those used in bacteriology to release cultures to protist collections. The whole community will benefit from this in the short and long term. In addition, there is an inherent technical bias in culturing, as well as a bias in culturing efforts. For example, phototrophic representatives of Stramenopila and Alveolata tend to have more cultures available than their heterotrophic counterparts ( Figure 4 ). Indeed, 70.6% of the most common protist strains present in culture collections are phototrophic organisms ( Figure 3 ). Therefore there is a need both to increase the culturing effort for a wider variety of environments and to develop novel and alternative culture techniques to retrieve refractory organisms [], both of which take time, energy, and funding. Importantly, culture collections will need to be supported so that they can take on the challenge of maintaining more cultures and open their scope to include more difficult organisms that tend to be excluded from existing collections, in particular heterotrophs.

the SAG is a non-profit organization maintained by the University of Göttingen ( http://www.epsag.uni-goettingen.de ). The collection primarily contains microscopic algae and cyanobacteria from freshwater or terrestrial habitats, but there are also some marine algae. With more than 2400 strains, the SAG is among the three largest culture collections of algae in the world. Prof. Pringsheim is also the founder of the SAG: it was initiated in 1953 when he returned to Göttingen after his time as a refugee scientist in England. From then on the Pringsheim algal collection has been growing and evolving into the service collection we know nowadays.

this collection ( http://www.roscoff-culture-collection.org ) is located at the Station Biologique de Roscoff and is closely linked to the Oceanic Plankton group of this institution. They maintain more than 3000 strains of marine phytoplankton, especially picoplankton and picoeukaryotes from various oceanic regions. Most of the strains are available for distribution whereas others are in the process of being described.

this integrated collection of marine algae, protozoa, bacteria, archaea, and viruses was named a National Center and Facility by the US Congress in 1992. The NCMA ( http://ncma.bigelow.org ) originated from private culture collections established by Dr Luigi Provasoli at Yale University and Dr Robert R.L. Guillard at Woods Hole Oceanographic Institution. When it was born in the 1980s it was known as the Culture Collection of Marine Phytoplankton (CCMP) and provided to the community algal cultures of scientific interest or for aquaculture.

a culture collection funded by the UK Natural Environmental Research Centre (NERC) that contains algae and protozoa from both freshwater and marine environments. The foundations of CCAP ( http://www.ccap.ac.uk ) were laid by Prof. Ernst Georg Pringsheim and his collaborators and the cultures they established at the Botanical Institute of the German University of Prague in the 1920s. Pringsheim moved to England where the collection was expanded and taken over by Cambridge University in 1947. In 1970 these cultures formed the basis of the Culture Centre of Algae and Protozoa that later became the modern CCAP.

a private, non-profit biological resource center established in 1925 with the aim of creating a central collection to supply microorganisms to scientists all over the world ( http://www.atcc.org ). ATCC collections include a great variety of biological materials such as cell lines, molecular genomics tools, microorganisms, and bioproducts. The microorganism collection includes more than 18 000 strains of bacteria, 3000 different types of viruses, over 49 000 yeast and fungal strains, and 2000 strains of protists.

Culture collections are cornerstones for the development of all microbiological disciplines. Cultures are key to the establishment of model organisms and, therefore, to a better understanding of their biology. Below we describe some of the major protistan collections.

Make the tree thrive: a call to action

Box 3 The rectification of names and our understanding of eukaryotic biology A better understanding on the eukaryotic diversity has a deep impact in several biological disciplines such as medicine, agriculture, evolution, and ecology. A large body of research backs up this statement. Below we mention a few examples that illustrate the power of having a better understanding of the diversity, biology, and evolution of eukaryotes. 38 Edman J.C.

et al. Ribosomal RNA sequence shows Pneumocystis carinii to be a member of the fungi. 39 Thines M.

Kamoun S. Oomycete–plant coevolution: recent advances and future prospects. 40 Haas B.J.

et al. Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Medicine has greatly benefited from evolutionary studies in eukaryotes. Studies on the genome and biology of close relatives of parasites have provided unique insights into analogous molecular mechanisms involved in the clinical effects of parasites. A proper taxonomic assignment of pathogenic organisms has also been the key to fighting them. A good example is Pneumocystis, an opportunistic pathogen affecting immunocompromised patients, predominantly HIV-infected. Pneumocystis was considered for years to a protozoan of unclear taxonomic assignment. It was not until molecular data allowed researchers to properly assign Pneumocystis to the fungi, in 1988, that adequate treatments based on antifungal agents could be used []. The opposite situation happened with the fungus-like Phytophthora, the causative agent of the potato blight. New molecular data showed that Phytophthora are peronosporomycetes (stramenopiles) within the order Peronosporales, and not fungi as previously thought, thus explaining the ineffective use of fungicides []. Further knowledge of its genome provided insights not only into its evolution but also into the potential reasons for its speed to form resistant forms []. 41 Sebé-Pedrós A.

et al. Ancient origin of the integrin-mediated adhesion and signaling machinery. 42 Sebé-Pedrós A.

et al. Evolution and classification of myosins, a paneukaryotic whole genome approach. It is, however, in evolutionary studies where the impact of having a broad taxon sampling of eukaryotes is more apparent. Indeed, and looking back in time, it is clear that the absence of key taxa in evolutionary analyses led to hypotheses that are now known to be in error. The fact is that to elucidate which genomic or morphological features have been conserved, which were ancestral to eukaryotes, and which are novel, one needs to perform comparative analyses that must include key taxa from each major eukaryotic lineage. For example: instances of lineage-specific gene loss in Choanoflagellatea and Fungi, and the absence of representative taxa from non-parasites Excavata and Rhizaria, confounded attempts to reconstruct accurately the gene content of the last unicellular ancestor of metazoans [] and the last eukaryotic common ancestor [], respectively. 43 Not F.

et al. New insights into the diversity of marine picoeukaryotes. Ecology is also influenced by a better understanding of eukaryote biology. The global ecological cycles are deeply influenced by several groups of eukaryotes, most of them unicellular. We have a good understanding of phototrophic eukaryotes that, together with the Cyanobacteria, drive most of the carbon cycle and the oxygen production on earth. Nevertheless, our understanding of heterotrophic protists remains insufficient. For example, both MASTs and the Syndiniales are extremely abundant in the oceans []. Therefore, they are surely influential in global processes. However, we cannot understand their role if we lack information on their metabolic pathways or biology, something we can only obtain from genomic data. Genome sequences have cast invaluable light on the classification of organisms, notably in many cases where particular species were misclassified ( Box 3 ). However, the available genome sequences of eukaryotes do not inform us only about the biology of the particular organism. They also make significant contributions to our understanding of eukaryotic biology in general, and to large-scale evolutionary and ecological processes. Nevertheless, for this potential to be completely fulfilled we must sample broadly, and there are currently important gaps in the diversity of eukaryotic genome sequences that undermine our efforts to capitalize on this potential. Understanding the whole of eukaryotic diversity will doubtless contribute to our understanding of specific biological questions, including some of our more pernicious problems in medicine, agriculture, evolution, and ecology.

We propose that filling in the eukaryotic tree at the genomic level based on phylogenetic diversity should be a priority for the community. We also argue that this can be achieved by a combination of three complementary approaches. First, at least one genome from underrepresented lineages from which cultures are available should be sequenced. This is a straightforward problem, requiring phycologists, protistologists, culture collection curators, and genomic sequencing centers to coordinate efforts and expertise to choose the best target taxa and sequencing strategies. Second, efforts to culture diverse organisms should be supported, by sampling additional areas of the planet, developing novel techniques to include more recalcitrant species (especially heterotrophs), and by rewarding this difficult but essential task, especially in younger researchers before they conclude en masse that such crucial work is a professional dead-end. Such efforts are time-consuming and have a built-in failure rate that makes them risky, and therefore policy changes will be helpful in order that funding agencies, universities, and research centers recognize the value of such work independently of the publication outcome. Finally, microbial ecologists and genomic centers should embrace the use of SCG and continue to improve the technology, which we believe will be the key to filling in missing parts of the tree in the short term. To coordinate all these efforts, funding agencies should also support the development of community resources such as publicly accessible culture collections and the maintenance of key taxa that are difficult to keep.

We believe strongly that the time is ripe to reverse the genome sequencing bias in the tree of eukaryotes. We now have in our hands all the elements needed to change this skewed view and further our understanding of eukaryotic biology and evolution. All that needs to change is the will and a joint coordinated initiative. Thus, we hope that the eukaryotic community will welcome this proposal to build a representative and diverse ‘Genomic Encyclopedia of Eukaryotes’ and collaborate to make this happen.