First published Thu Oct 20, 2016

Section 1 will introduce and discuss several key terms, such as ‘genome’ or ‘genomics’. Section 2 will then turn to the question of what it means to read and interpret the genome. What did the sequencing and the mapping of the human genome entail and what philosophical issues arose in the context of the human genome project? How did sequencing evolve into a much larger ‘postgenomic’ enterprise and what issues did this transformation bring about? To answer the last question Section 3 will consider two different projects, perhaps newly emerging fields, namely the HapMap project, and metagenomics. In the supplementary document The ENCODE Project and the ENCODE Controversy , we will look at the ENCODE project and the controversy that surrounded it. These three cases will highlight key issues that come up again and again in the context of genomics and postgenomics.

But many things are also different now. For instance, China has emerged as a major player in the genomics field, with the BGI (formerly the Beijing Genomics Institute) already announcing in 2011 the aim to sequence one million genomes. Moreover, DNA sequencing is no longer the only goal of these large-scale initiatives: the new genomics is of course still a genome-based effort, but it is a transformed enterprise that also focuses on data about proteins, DNA methylation [ 1 ] patterns or the physiology and the environment of the people studied; DNA sequence data now forms only part of a much larger picture in the push for what is called ‘precision’ or ‘personalised’ medicine. Developments such as these have led many to refer to the present as a ‘postgenomic’ age (Richardson & Stevens 2015). The goal of this entry is to look at this constantly developing space of genomic and postgenomic research and outline some of the central philosophical issues it raises.

More than a decade later genomics is still big in business (and big business): the Obama administration announced in January 2015 that they intend to sequence one million human genomes (see Precision Medicine Initiative in the Other Internet Resources section below; see also Reardon 2015). Craig Venter, the commercially minded nemesis of the publicly-funded HGP is also in the mix again, this time involved in a privately-funded collaboration that aims to sequence two million genomes over the course of the next ten years (Ledford 2016). And equally important, we see not only the same players clash again but also the same promises being made, with talk of “groundbreaking health benefits” and “new medical breakthroughs” appearing once again in press releases and other announcements (see for instance Collins & Varmus 2015 or NIH 2015).

About 30 years ago researchers and other stakeholders started setting up the first genomics initiative, the Human Genome Project (HGP) (see the link to All About the Human Genome Project (HGP) in the Other Internet Resources section below). What was conceived as an audacious plan in the 1980s turned into an official multi-centre, international program in 1990 and was brought to a conclusion in 2003.

1. Terminology and Definitions

1.1 Gene—Genome—Genomics

The term ‘genomics’ derives from the term ‘genome’, which itself derives (in part) from the term ‘gene’. The meaning(s) of—and the relationships between—these different terms is by no means simple.

The term ‘gene’ was introduced in 1909 by the Danish biologist Wilhelm Johannsen, who used it to refer to the (then uncharacterised) elements that specify the inherited characteristics of an organism (see gene and molecular genetics entries for an overview of the complex history of the term ‘gene’).

The term ‘genome’ was introduced in 1920 by the German botanist Hans Winkler (1877–1945) in his publication “Verbreitung und Ursache der Parthenogenesis im Pflanzen- und Tierreiche” (Prevalence and Cause of Parthenogenesis in the Plant and Animal Kingdom). Winkler defined the term as follows:

Ich schlage vor, für den haploiden Chromosomensatz, der im Verein mit dem zugehörigen Protoplasma die materielle Grundlage der systematischen Einheit darstellt, den Ausdruck: das Genom zu verwenden […]. (Winkler 1920: 165) I propose to use the expression ‘genome’ for the haploid set of chromosomes that, in conjunction with the associated protoplasm, represents the material foundation of the systematic unit [often translated as ‘species’]. (Translation by S.G.)

The etymology of the term is not clear but most authors and encyclopaedia entries assume that it is a combination of the German words ‘Gen’ and ‘Chromosom’, leading to the composite ‘Genom’. In general, the origin and the different meanings of the -ome suffix are not entirely clear and there are now several accounts that try to bring some structure and/or meaning to the ever flourishing -omes terminology in contemporary life sciences (see, e.g., Lederberg & McCray 2001; Fields & Johnston 2002; Yadav 2007; Eisen 2012: Baker 2013; for interesting/entertaining lists, see -omes and -omics in the Other Internet Resources section below).

The term ‘genomics’, finally, was invented in 1986 at a meeting of several scientists who were brainstorming (in a bar) to come up with a name for a new journal that Frank Ruddle (Yale University) and Victor McKusick (Johns Hopkins University) were setting up. The aim of this journal was to publish data on the sequencing, mapping and comparison of genomes. To capture these different activities—and in analogy to the well-established discipline of genetics—Thomas Roderick (Jackson Laboratory) proposed the term ‘genomics’ (Kuska 1998). Unbeknownst to the people involved this was a significant moment in the history of the life sciences, as it is here that the -omics suffix appears for the first time.

1.2 What is a Genome?

Looking at the history and the etymology of a term does not, of course, necessarily tell us a lot about how it is used in the context of current science. So what is a genome in today’s life sciences? Is it the (haploid) set of chromosomes we find in the nucleus of a eukaryotic cell, in line with the original definition by Winkler? Or is it the totality of genes we find in an organism or the totality of DNA present in a cell? And if so, which DNA? Most definitions that are currently in circulation are an intricate mix of different ways of approaching the issue. This can be illustrated by looking at the definitions given in several key online resources (for more definitions of the term ‘genome’ see Table 1 in Keller 2011).

The term is defined on the genome.gov website glossary:

The genome is the entire set of genetic instructions found in a cell. In humans, the genome consists of 23 pairs of chromosomes, found in the nucleus, as well as a small chromosome found in the cells’ mitochondria. Each set of 23 chromosomes contains approximately 3.1 billion bases of DNA sequence. (Talking Glossary: genome, in the Other Internet Resources)

And this is how the U.S. National Library of Medicine defines it:

A genome is an organism’s complete set of DNA, including all of its genes. Each genome contains all of the information needed to build and maintain that organism. In humans, a copy of the entire genome—more than 3 billion DNA base pairs—is contained in all cells that have a nucleus. (NIH 2016)

Similarly the education portal of the journal Nature:

A genome is the complete set of genetic information in an organism. It provides all of the information the organism requires to function. In living organisms, the genome is stored in long molecules of DNA called chromosomes. (Scitable: genome, in the Other Internet Resources)

All of these definitions refer both to information and to instructions for the development and/or functioning of an organism. In the first two, the genome is also identified with a material entity, in the first case the chromosomes, in the second a sequence of base pairs. Nature allows only that the information is “stored in” the chromosomes.

The combination of these two aspects is highly problematic. The definition from the U.S. National Library of Medicine implies that “all of the information needed to build and maintain that organism” is contained in the DNA,[2] which is certainly false: many environmental factors, not to mention factors in the maternal cytoplasm, are required for the first task, and even more obviously (food, light, etc.) for the second. Moreover when, as is almost always the case, an organism requires symbiotic partners for its proper functioning, such a definition will imply that the DNA of these symbionts is part of the genome of the first organism, a result that few would welcome.[3] The Nature definition commits the same error in its second sentence. The genome.gov definition appears to identify the chromosomes both with a set of instructions and a material entity, which appears rather problematically to conflate a material object with an abstract entity.

The problem is not hard to see. Attempting to combine aspects of the material base of the genome and its informational content, as all these definitions do, inevitably assume some simple relation between these two; but in fact the relationship is extremely complex. Because the informational content of the genome is dependent in multiple ways on elements that are not, on any account, part of the genome, an account in terms purely of informational content seems a hopeless project.

One commonly held view that can be quickly dismissed, is the idea that the genome is just the sum total of an organism’s genes. The problem here is just that even passing over the well-known problems with saying what a gene is (see Barnes & Dupré 2008; SEP entry on the gene), on any tenable account of genes, there is far more to the genome than genes, and only a fraction of the actual DNA contained in the chromosomes would be part of the genome, at least in the case of humans and other organisms that have a relatively large amount of non-coding DNA (Barnes & Dupré 2008: 76).[4] Even if ‘gene’ is interpreted in the widest possible sense, including any section of the genome that has some identifiable function, no one denies that a significant amount of DNA is not functional. The rest of the DNA would not form part of the genome, an outcome that contradicts all definitions of the genome of which we are aware, and makes nonsense of such familiar concepts as ‘whole-genome sequencing’, which refers to the analysis of all the DNA found in the chromosomes.

There are, we suggest, two initially tenable approaches to the problem.[5] The first, and one that is often implicitly or explicitly assumed to be correct, is to define the genome as the sequence of nucleotides. This may or may not contain extranuclear DNA, as in mitochondria or chloroplasts; the genome.gov definition explicitly includes the former. This last question figured largely in debates over the moral permissibility of so-called mitochondrial transplants (a designation that speaks volumes, incidentally, about the almost magical importance attached to DNA as opposed to the remaining contents of the cell), but it is not one of great philosophical significance. The alternative approach is to understand the genome strictly as a material object, presumably, in most cases, the nuclear chromosomes.

The problem with the first approach is that it is largely motivated by the assumption that the nucleotide sequence is what contains all the important information in the genome. But in fact it has become increasingly clear that this is not the case, especially as a result of the growing understanding of epigenetics. Epigenetics is the study of material modifications of the genome that affect what parts of the genome sequence are or are not transcribed into RNA, the first stage of the process by which the genome influences the containing organism. The two most well-studied classes of epigenetic modification are methylation, the attachment of a methyl group (-CH 3 ) to one of the four nucleotides, cytosine, and various chemical modifications of the histone proteins, proteins that form the core structure of the chromosomes, and around which the DNA double helix is wrapped (Bickmore & van Steensel 2013; Cutter & Hayes 2015). The nucleotide sequence, then, provides the (extremely large) set of possible transcripts that the genome can produce, but the epigenetic state of the genome determines which transcripts are actually produced (Jones 2012). Both features of the genome (qua material object) must be specified, therefore, if we want to understand the biologically relevant behavior of the whole system.

So if the motivation for defining the genome in terms of sequence is to capture its informational content, the definition fails to serve its goal. Indeed, the definition that will come closest to this goal is that which identifies the genome as the material object, the set of chromosomes (this interpretation of the genome is defended in detail in Barnes & Dupré 2008). An implication of this definition that is often taken to be counterintuitive by biologists is that the genome will on this account encompass not only DNA, but the histone proteins that are material parts of the chromosomes. But of course the point of the preceding discussion is that the variable chemical states of the histones are, in fact, essential bases for some of the information inherent in the genome.

The phenomenon of methylation makes a similar point in a slightly different way. The nucleotides that comprise the familiar sequence are cytosine, thymine, adenine and guanine). When a methyl group attaches to the cytosine molecule the resultant nucleotide is not, strictly speaking, cytosine, but 5-methyl cytosine. So unless one takes the letter ‘C’ in the standard representation of sequence to mean, rather counterintuitively, “cytosine or 5-methyl cytosine”, it is only a partially accurate representation of the feature of the genome it purports to represent. More importantly, it is a representation that fails to capture crucial functional aspects of the genome.

A final telling point is that it has recently become clear that there are functions of the genome, as material object, that go well beyond even the broadest interpretation of the genetic (Bustin & Misteli 2016). It appears that the genome plays an essential role in a range of cellular processes. First, its physical arrangement into domains of varying sizes plays a central role in the coordination of gene expression. But much further from the genetic, it is a large object the mechanical forces of which are involved in various cellular processes and cellular homeostasis, and the chromatin fiber provides a scaffolding for both proteins and membranes (Bustin & Misteli 2016). Unless we are to introduce a new word to refer to this biologically vital entity, only a material conception of the genome can capture the full range of its activities.

One might be tempted to object to the argument above concerning methylation, that whereas methylation is a somewhat transitory state, the underlying four-letter sequence is extremely durable, lasting across many generations. Richard Dawkins (1976) famously emphasized the importance of this durability in arguing for the importance of this stability in evolution, even going so far as to describe genes as “immortal”. So perhaps there is a good reason for understanding “C” as referring to a disjunction.

This is not the place to address the quasi-theological view of gene immortality. However, this does point to a fundamental issue about the nature of genes. Even if genes, somehow, were unchanging immortal substances, the genome is nothing of the sort. It is an extremely dynamic entity, constantly changing its properties in generally adaptive response to it environment. Moreover even the constancy of its nucleotide sequence is something maintained only by the continuous application of various editing and repair mechanisms. Indeed, far from being an eternal substance, we suggest it is much better seen as a process, a highly complex set of dynamic activities crucial in maintaining the structural and functional stability not only of the organism but also, through its role in reproduction, of the lineage. Importantly, these relations are bi-directional and, specifically, the organism is also crucial to maintaining the necessary aspects of stability of the genome.[6]

2. Reading the Genome

The first genome to be sequenced was that of a virus, namely bacteriophage ΦX174, sequenced by Frederick Sanger in 1977 (Sanger et al. 1977). Up to about 1985, work on several other viruses was initiated in different laboratories across the world and even the sequencing of model organisms such as the bacterium Escherichia coli or the roundworm Caenorhabditis elegans was being tackled.[7]

Of all the different sequencing efforts at the time the human genome project (HGP) of course stands out. Not only is the human genome relatively large (roughly 3.2 billion base pairs (bps)) and of key interest to us as human beings, but the HGP itself was envisioned as a diverse large-scale research project with various strands and aims. Getting the sequence out of this project was the one goal that got the most attention in the wider media, but surely many would agree that other findings and practices developed within the HGP were of equal or even greater importance.

In what follows we will treat the HGP as a pivot around which genomics developed as a field of research and as a set of techniques. For ease of exposition we will talk here of a pre-HGP and a post-HGP phase. Obviously, this is a simplification; there is not just one single trajectory along which the story of genomics runs and there is not one clear break between a pre- and a post-genome era (Richardson & Stevens 2015). Nevertheless, as a way of structuring the discussion this distinction will be a helpful tool.

2.1 The Run-up to the HGP

A decade after Sanger and Maxam and Gilbert published their DNA sequencing methods in 1975 the first concrete talk of a human genome project started to appear in writing (Dulbecco 1986) and at different workshops (Sinsheimer 1989; Palca 1986). The Human Genome Project (HGP) itself became a reality in 1990 when it was officially launched as a US federal program (see 1990 in a brief history and timeline [NHGRI] in the Other Internet Resources section below).

In the run-up to the HGP there were high expectations (some would say “hype”) developing, which inevitably also brought critics of the project onto the scene (Koshland 1989; Luria et al. 1989). As so often, the issue of funding had a key role to play. When the HGP was initiated there were no ‘big science’ projects being pursued in the life sciences. The HGP therefore was a true first for biology. But pushing such a large project that absorbed a significant proportion of the funding allocated to the biological sciences encountered a lot of resistance from other scientists.

There were three key criticisms: 1) Some claimed that the HGP was a waste of money because much useless (read: junk) DNA was sequenced; the focus should be more directly on the functional parts of the genome, i.e., the genes or regulatory elements, which could be achieved using simpler and less expensive methods (Brenner 1990; Weinberg 1991; Rechsteiner 1991; Lewontin 1992; Rosenberg 1994). Others claimed 2) that the HGP was a waste of money as it was merely a descriptive and not a hypothesis-driven project. This was an issue that became much more prominent ten years after the project was finished, when it became clear that big data science was here to stay (see, e.g., Weinberg 2010).[8]

And last but not least there was also the critique 3) that the HGP is fundamentally misguided as it assumes that by using sequence knowledge alone we would be able to develop an understanding of how our body works, how it develops disease, and that this understanding will eventually lead to cures for many diseases (Lewontin 1992; Tauber & Sarkar 1992; Kitcher 1994). This more general critique of a narrowly sequence-focused approach to biomedical issues also comes up 20 years later in discussions about the use of common genetic variants to learn more about common diseases and traits (see Section 3.1.2).

It is difficult to evaluate criticisms of the last kind. There is no doubt that enthusiasm for the HGP and many other successor projects in genomics has often been grounded in simplistic assumptions about the power of DNA and its pre-eminent role in biological systems. On the other hand it is arguable that many unanticipated benefits have derived from genomics quite independently of such assumptions. For instance the ability to make very precise comparisons of genome sequences has led to major advances in unraveling the details of evolutionary history, not to mention its application to technologies such as forensic DNA testing. Moreover, it can be argued with Waters (2007b) that what makes genomes so central to biological research is not the erroneous belief that they are the ultimate causes of everything, but rather the unique possibilities they present for precise intervention in organisms or cells.

2.2 The Results and Impact of the HGP

The main output of the HGP is usually seen as ‘the’ human genome sequence. The draft human genome sequence (about 90% complete) was announced in June 2000, followed in 2001 by the publication of the draft sequences produced by the HGP (International Human Genome Sequencing Consortium 2001) and the privately funded initiative (Venter et al. 2001). The complete (or almost complete (99%)) sequence of the human genome was released in 2003, which also marked the official ending of the HGP (International Human Genome Sequencing Consortium 2004).

But the view that the sequence of ‘the’ human genome was the key output is wrong in several ways. First of all there is in general no such thing as ‘the’ human genome, as each individual (except for monozygotic twins) carries their own set of small and large variations in their genome (and even for twins there are many differences they accumulate in their genomes during their lifetime). The sequence that was produced in the HGP is therefore nothing more than an example of one particular sequence, meaning it can only serve as a reference genome. Importantly, the reference sequences that both the HGP and Venter’s project delivered did not correspond to the genome of a single person as the DNA used to produce them was derived from several individuals.[9] The genomes that came out of the two sequencing efforts were therefore composite reference sequences. But the HGP also produced much more than just a DNA sequence. Here we will highlight three outcomes or aspects of the HGP that are of particular importance, also for the period that followed the completion of the project.

One key feature of the HGP was that it involved the sequencing of a range of different model organisms, an aspect of the HGP that was often overlooked in discussions of the project in the philosophical literature and elsewhere (Ankeny 2001; for a searchable list of sequenced genomes see genome information by organism in the Other Internet Resources section below). The HGP provided not only a first reference genome of Homo sapiens but also the first bacterial genome (Haemophilus influenzae, Fleischmann et al. 1995), the first eukaryotic genome (Saccharomyces cerevisiae, Goffeau et al. 1996), and the genomes of key model organisms (Escherichia Coli, Blattner et al. 1997; Caenorhabditis elegans, C. elegans Sequencing Consortium 1998; Arabidopsis thaliana, Arabidopsis Genome Initiative 2000; Drosophila melanogaster, Adams et al. 2000, Myers et al. 2000).[10]

A further crucial output was the acceleration in technology development the HGP brought about. It is safe to say that without the HGP (and subsequent initiatives such as the Advanced Sequencing Technology Awards created in 2004 by the National Human Genome Research Institute (NHGRI) (NIH 2004)) there wouldn’t have been such a rapid development in next-generation sequencing (NGS) approaches and the cost of whole genome sequencing would not have dropped as quickly as it has (see Mardis 2011 for a review of the development of NGS). And these improvements in the sequencing technology had further consequences, for example allowing scientists to sample DNA in different ways and from different sources, as new sequencing methods could process more DNA material more quickly and work with less starting material. This, finally, made possible whole new sub-disciplines, such as metagenomics (see Section 3.2).

A final noteworthy output of the HGP is what scientists learned about the structure of the genome. Beginning with the HGP, and building on further studies, researchers have gained a much more detailed picture of the fine structure, the dynamics and the functioning of the human genome. It was not only that there were many fewer genes present than expected, but there was also much more repetitive DNA and transposable elements present (it is estimated that about 45% of human DNA consists of transposable elements or their inactive remnants). These findings relate to a more general and older discussion about genome size and complexity to which we next turn.

2.3 Genome Size, the C-value Paradox and Junk DNA

It has been known since the 1950s that genome size varies greatly between different organisms (Mirsky & Ris 1951; see also Gregory 2001), but from the very beginning it was also clear that this diversity has some surprising features. One of these features is the absence of correlation between the complexity of an organism and the size of its genome.

2.3.1 The C-value Paradox

Assuming an informational account of the genome one would expect that the more complex an organism is, the more DNA its genome should contain (this is in fact what many biologists assumed at least until about the 1960s). How to define and assess the complexity of an organism is a tricky issue, but intuitively it seems reasonable to assume that a single-celled amoeba is less complex than an onion, which in turn is less complex than a large metazoan such as a human being, both in terms of the complexity of the workings and the structure of the organism. The expectation was that the DNA content of human cells should be much larger than that of onions or amoebae. As it turns out, however, both the onion and the amoeba have much larger genomes than human beings. The onion, for instance, has a genome of about 16 billion base pairs, meaning it is about five times the size of the human genome (Gregory 2007). The same lack of correlation between genome size and complexity can be found in many other instances (for an overview of different genome sizes see the animal genome size database in the Other Internet Resources section below).

It was also found early on that very similar species in the same genus show large variation in genome size, despite having similar phenotypes and karyotypes (i.e., number and shape of chromosomes in a genome). Within the family of buttercups, for instance, DNA content varied up to 80-fold (Rothfels et al. 1966). Also, Holm-Hansen (1969) showed that species of unicellular algae display a 2000-fold difference in DNA content despite all being of similar developmental complexity. It was findings such as these that gave a real urgency to addressing this discrepancy that was now labelled the C-value paradox (Thomas 1971). The term ‘C-value’ refers to the constant (‘C’) amount (‘value’) of haploid DNA per nucleus and is measured in picograms of DNA per nucleus. The C-value is a measure of the amount of DNA each genome contains (we can see here Winkler’s original definition of the genome at work).

2.3.2 Junk DNA

These discussions of genome sizes were closely related to concerns about gene numbers. And this consideration of genome size vs. gene numbers is what originally gave rise to the concept of ‘junk DNA’ (Ohno 1972).[11] The reasoning behind this concept was the following: if one assumes a) that more complex organisms will have more DNA than less complex organisms and b) that gene numbers increase in proportion with genome size, then the genome of the more complex organism should have more genes than the less complex one.[12] Human cells, for instance, contain about 750x more DNA than E. coli, meaning that they should turn out to have in the range of 3.7 million genes, as E. coli has about 5000 genes. This is clearly not the case; even in the 1970s it was generally supposed that the human genome might contain no more than 150,000 genes (Crollius et al. 2000). This discrepancy leads to the conclusion that the vast majority of the DNA in our genome cannot be genes and is therefore what Ohno referred to as ‘junk’.[13]

The problem that the junk DNA discussion brings up has also been referred to as the ‘G-value paradox’ (‘G’ stands for ‘gene’), which directly concerns the discrepancy between the number of genes in an organism and its complexity (Hahn & Wray 2002). This paradox has been reinforced by the findings of the HGP. As Gregory (2005) and other commentators have pointed out, the finding that the human genome contains many fewer genes than expected was one of the most surprising outcomes of the HGP. Initial estimates from before the project were in the range of 50,000 to 150,000. These were reduced to about 30,000—35,000 after the publication of the first sequence draft in 2001 and have now been further revised to the order of 20,000 (Gregory 2001).

Some researchers assumed that the C-value paradox was fully resolved by the recognition that there is non-coding DNA in genomes (Gregory 2001). Larger genome size in ‘simpler’ organisms merely means that they have large quantities of non-coding DNA. But as Gregory points out, the fact that the majority of DNA in our genomes is non-coding might make the C-value discrepancies less of a paradox, but it gives rise to a whole range of further puzzles (Where does this extra DNA come from? What is its function? Etc.), which is why he proposes to talk of the C-value as an enigma rather than a paradox (Gregory 2001). The C-value enigma consists of many different and layered problems and these require a pluralistic approach to answering them, or so Gregory claims.

The publication of the draft genome sequence in 2001 and the conclusion of the HGP in 2003 did not give researchers all the tools and insights they needed to tackle these long-standing problems. But after the HGP, building on the initial sequencing effort, researchers could start to go beyond the mere sequence and gain a deeper understanding of the workings of the genome. This put them in a position to tackle issues such as the significance of junk DNA and the C-value paradox more directly (or at least from a different angle). The post-HGP phase is also characterized by an intense debate about the best way of doing research: the question of whether biological research should best be done on a small or a large scale has come up again and again in the post-HGP era, especially with the rise of other post-HGP large scale projects. The next section will address two projects/research fields that symbolize the various efforts and aspirations that were characteristic of the post-HGP era and which will help to illuminate some of the philosophical issues these developments raised.

3. Beyond Sequencing

The post-HGP phase is marked by a flourishing of different projects, closely connected in their origins to the HGP, but going beyond it in many different ways. This section discusses two such post-HGP projects, namely the International HapMap project and a new field of research called ‘metagenomics’. These examples indicate some important directions in which the postgenomic era is heading and identify some, though certainly not all, of the key characteristics and issues that mark this new period.

3.1 The International HapMap Project

The International HapMap project was a multi-centre project launched in 2002 that came to an initial conclusion in 2005 (NIH 2002).[14] The acronym ‘HapMap’ stands for ‘haplotype map’ and (indirectly) refers to the main goal of the project, namely to map the common genetic variation in the human genome.

3.1.1 The HapMap Project and Genomic Variation

It is a well-known fact that everyone’s genome is different. There are, however, several ways in which genomes of individuals can vary from each other, ranging from the deletion, insertion or rearrangement of longer stretches of DNA to differences in single nucleotides at specific locations on a chromosome. The latter form of variation was the focus of the HapMap project. If we align the DNA sequence of two individuals they will be identical for hundreds of nucleotides; the DNA of two human beings typically displays about 99.9% sequence identity (Li & Sadler 1991; Wang et al. 1998; Cargill et al. 1999). But the 0.1% difference means that approximately every 1000 nucleotides there will be a difference in a single nucleotide between any two individuals.

Any variation at a specific genomic locus is referred to as an ‘allele’. If there are two different versions of a specific gene that can be found in a population at a specific locus on a chromosome, then that means that there are two different alleles of that gene present in that population.[15] If one of these single nucleotide alleles is found in more than 1% of a specific population it is treated as a ‘common’ variant and researchers speak of a ‘polymorphism’ or, more precisely, a ‘single nucleotide polymorphism’ (abbreviated ‘SNP’; pronounced ‘snip’). If a variation is found in less than 1% of the population researchers simply call it a ‘mutation’ (or also a ‘point mutation’).[16] On average there are about 3 million SNPs found in each individual and there is a pool of more than 10 million SNPs present in the human population as a whole (HapMap 2005).

Many of these alleles are (or have an increased likelihood of being) inherited together, meaning that they do not easily become separated through recombination events during meiosis.[17] This leads to the non-random association of different alleles at two or more loci, a phenomenon that has been dubbed ‘linkage disequilibrium’ or ‘LD’. The concept of LD is key for the HapMap project as the fact that some SNPs stay associated (whereas the clusters themselves might get separated from each other over time by recombination events) explains the haplotype structure of the genome (Daly et al. 2001). The term ‘haplotype’ simply refers to a particular cluster of alleles (in this case SNPs) that a) are on the same chromosomes and b) are commonly inherited as one. The aim of the HapMap project was to characterize human SNPs, their frequency in different populations and the correlations between them (HapMap 2003). The first haplotype map was published in 2005, reporting on data from 269 samples derived from four different populations (HapMap 2005). Five years later, a follow up was published, now reporting on data from 1184 individuals sampled from 11 different populations (HapMap 2010).

The realization that the structure of genetic variation in the genome can be understood in terms of haplotypes was important for at least two reasons. First it opened the door for a relatively easy and efficient analysis of (single nucleotide) genetic variation in populations: the clustering of SNPs meant that in principle only one or a few of the SNPs in each cluster (so-called ‘tag SNPs’) would have to be tested to verify the presence of the cluster of variants as a whole. This made the analysis of genetic variation at the level of whole genomes from a large number of subjects feasible at a time when whole-genome sequencing was still too expensive for such a task (HapMap 2003). The development of a haplotype map was therefore a crucial step to enable what are now called ‘genome-wide association studies’ (GWAS) (see Section 3.1.2).

Secondly, as the distribution of haplotypes varies between different populations, the HapMap project had a strong focus on sampling DNA from different populations. This is an important aspect of this type of research as it brought, unwittingly perhaps, the issue of race and the question of its biological basis right back into genomics. This point will be revisited in Section 3.1.4.

3.1.2 The HapMap, GWAS and the Idea of Personalized Medicine

A key point driving the HapMap project was the fact that SNPs can be used to uncover connections between an individual’s DNA sequence and specific conditions or traits. At face value an SNP is simply a distinguishing mark in the genome of a person. Such marks allow researchers to screen groups of a population with different phenotypes, for instance those with a condition (e.g., high blood pressure) and those without. Looking at the frequency of specific SNPs or haplotypes in either group the researchers can use statistical analysis to get insight into the association between a particular SNP or haplotype and a trait (Cardon & Bell 2001). As mentioned above, this analysis can be focused on tag SNPs that are treated as proxies for a whole cluster of SNPs (if the cluster has a high LD).

Once a haplotype has been associated with a particular condition, other people can be screened for the presence of that haplotype and therefore gain some understanding of the risk groups they belong to. Although the test will not tell carriers of disease-linked SNPs whether they will develop the condition or not, it can nevertheless give them some information about their chances. Furthermore, even though the tag SNP itself might not be the genetic variation that causes or contributes to the variation in phenotype, it might be linked to so-called ‘causal SNPs’. Learning about SNPs associated with a condition or trait therefore can give the researcher clues as to which genes or regulatory DNA regions might be causally involved in the development of that condition. Findings from association studies can therefore in some cases contribute to the analysis of the condition itself.

The HapMap initially only looked for common variants (SNPs include by definition only common variants). This was in line with the so-called common disease/common variant (CD/CV) hypothesis formulated by Lander (1996); Cargill et al. (1999), and Chakravarti (1999).[18] This hypothesis postulates, roughly, that common conditions are linked to genetic variations that are common in a population.

This link between common variants and common diseases also explains why the HapMap project could be promoted from the very beginning as the ‘next big thing’ after the sequence of the human genome had been determined: it was with the haplotype map that genomics should really start to have an impact on biomedical research and ultimately our understanding of disease.[19]

3.1.3 The HapMap and its Critics

But the HapMap project was not without its critics; indeed the biologist David Botstein called it a “magnificent failure” (cited in Hall 2010).[20] Some commentators, for instance, were worried that the project is nothing more than a make-work project filling a gap that the finished HGP left behind, and therefore a waste of precious funds (Couzin 2002). But more often, criticism of the HapMap project was part of wider debates about the way post-HGP research should be conducted. The HapMap project can therefore provide a useful window on some of the key tendencies and disputes that marked (or marred) the post-HGP era.

One such indirect criticism of the HapMap derives from the apparent failure of GWAS to lead researchers to clearer information about the links between our genetic makeup and the different conditions to which our bodies can succumb. In the eyes of these critics the CD/CV hypothesis was the key problem, as the common variants simply do not explain much of the heritability of common diseases. This observation gave rise to the concept of ‘missing heritability’ (Eichler et al. 2010).

The general focus on common variants in genomics was criticized by other authors who claimed that the focus of geneticists should rather be on rare variants (McClellan & King 2010). These rare variants, they claim, are where the missing heritability will be found. The problem with the rare variants is that they cannot be picked up in GWAS that use SNP databases, as SNPs are by definition common variants. Also, finding rare variants is a technical challenge as researchers have to analyse the genomic data of a very large number of individuals to do so reliably. This hunt for rare variants is a major reason behind the current push for the sequencing of millions (rather than a couple of hundreds or thousands) of genomes. As discussed earlier, such large-scale approaches have become feasible in recent years due to the reduced cost and increased speed of next-generation DNA sequencing.

The current shift to whole-genome sequencing will also help to address another critique of the GWAS/SNP/HapMap approach, namely its strict focus on single base pair changes in the genome. Other changes in the genome, such as variations in the numbers of copies of repeated elements or rearrangements, deletions or insertions of larger chunks of genomic DNA, might in many cases be what is at the core of a disorder, necessitating (again) a shift in focus away from point mutations and single genes to the genome as a whole (Lupski 1998, 2009).

As one of the first follow-ups to the original HGP, the HapMap project was a topic that often came up in discussions of the legacy of the HGP. Such discussions became especially prominent at the tenth anniversary of the publication of the draft genome sequence. In general, there was an overwhelming sense of disappointment at what had come out of the HPG, at least in the medical context. Given the grand promises that were made both around the start of the project in the 1980s and then again in the year 2000 at the presentation at the White House,[21] it is not surprising that people were unimpressed by what had been delivered by 2010/2011. Interestingly, it was not only the usual suspects, such as Lewontin (2011), but also key proponents of the HGP itself who were critical and pointed out the minimal medical advances that had been achieved in the first post-HGP decade (Collins 2010; Venter 2010).

However, one thing that all critics, including the above-mentioned, agreed on was that even though its effect on medical practice had been negligible, the HGP had transformed biological research (see for instance Wade 2010; Varmus 2010; Hall 2010; Butler 2010; Green et al. 2011). One area in which genomic research had fundamentally changed both concepts and practices was in the understanding of what a gene is and how gene expression works and is regulated (Keller 2000; Moss 2003; Dupré 2005; Griffiths & Stotz 2006; Stotz et al. 2006; Check 2010). With great foresight, Evelyn Fox Keller pointed out already in 2000 that the HGP was interesting not so much because of the raw sequence it produced, but more because of the transformations it brought about in our expectations when it comes to ‘genes’ and DNA (Keller 2000).

3.1.4 The HapMap, Genomics and Race

As mentioned above, HapMap’s use of samples from different populations brought the concept of race into discussions of the project. Studies that looked into the genetic variation between population groups (of which the HapMap was a key representative) are among several recent developments (Duster 2015) that reignited discussion about a) the biological reality of race and b) the question whether racial classifications should be used in biomedical research at all. Several authors have picked up the relation between the HapMap project and a renewed concern with race (see, e.g., Ossorio 2005; Duster 2005; Hamilton 2008). The question that dominates these discussions is whether racial classifications reflect a ‘biological reality’.

Race has of course been an important topic in epidemiology and clinical research for a long time (Witzig 1996; Stolley 1999), but it has been widely perceived as a socially constructed category that has no biological basis.[22] And many researchers imagined that as the HGP demonstrated how highly similar any two human beings are to each other at the DNA level, any idea of race as serious biological concept would be disposed of once and for all (see, e.g., Gilbert 1992; Venter 2000). But the concept of biological race was if anything rejuvenated rather than laid to rest by the developments in genomics (Kaufman & Cooper 2001; Foster & Sharp 2002; Hamilton 2008; Roberts 2011). This is exemplified by the fact that more and more scientists have claimed in recent years that there is a biological basis to our traditional notions of race, basing their claims on elaborate statistical analyses of data on genetic variation derived from a large number of human DNA samples. These developments led for many to what Troy Duster has called a ‘post-genomic surprise’ (Duster 2015).

An important point here is that linking genomics and race does not mean that researchers search for, or even that there are, any ‘genes for race’, even if we consider the many different ways in which this term can be interpreted (Dupré 2008). The discussion about the possible genetic basis for race is now more subtle, as it is not simply concerned with the presence or absence of specific genes or DNA elements and hence some sort of biological essence of races, but rather with the variation in the frequencies of alleles in the population of interest (Gannett 2001, 2004). The question is therefore not whether DNA element X is absent or present in one population or the other, but rather which variant of X is present at what frequency in a population (in the context of the HapMap researchers will talk of SNP frequencies).

Data from population genetics shows that the global distribution of allele frequencies in the human population is not discontinuous (Jorde & Wooding 2004; Feldman & Lewontin 2008) but clinal, meaning that human DNA sequences vary in a gradual manner over geographic space (Livingstone 1962; Serre & Pääbo 2004; Barbujani & Colonna 2010). Moreover, both genetic and phenotypic traits display what is called ‘nonconcordant’ clinal variation, meaning that different traits do not necessarily co-vary with each other; the pattern of how trait A varies across geographic space might be very different from the pattern displayed by trait B (Livingstone 1962; Goodman 2000; Jorde & Wooding 2004).

But despite these widely accepted findings, it is in the discussion of these distributions that the idea of a biological basis for our traditional understanding of race classifications has re-emerged. Based on the analysis of large sets of genetic variants in samples derived from various locations around the globe, a number of researchers have made the claim that human genetic variation displays geographical clustering (see, e.g., Rosenberg et al. 2002; Edwards 2003; Burchard et al. 2003; Bamshad et al. 2003; Leroi 2005; Tang et al. 2005). Importantly, these findings often also gave rise to, or were interpreted to support, the claim that this geographical distribution matches our traditional racial classifications.

Such findings also led a number of authors to claim that race still has a valid place in biomedical research: since these classifications are supposed to describe groups that are internally genetically similar, but genetically different from other groups, they can serve as useful proxies in estimating, for instance, the group member’s average risk of developing a particular condition (see, e.g., Xie et al. 2001; Wood 2001; Risch et al. 2002; Rosenberg et al. 2002; Shiao et al. 2012). Some authors are more cautious and claim that race should only serve as a loose and temporary proxy (Foster & Sharp 2002; Jorde & Wooding 2004) that should be abandoned as soon as we know the actual genetic variations that are linked to a particular condition or trait (Jorde & Wooding 2004; Leroi 2005; Dupré 2008). Such critics may note that the most that these genetic studies show is that there is a correlation between a person’s genetic variants and their geographical origin, if only because variants originate in a specific place; and there is a loose relation between the socially constructed concept of race and geographic origin. But given the tenuous connection that this generates between perceived or self-identified racial categories and genetic constitution, race is a poor substitute for any actually salient genetic information that may eventually be related to disease.

But there is also a significant group of researchers who are not convinced by these analyses and who don’t think that there is any biological basis to the race concept (see, e.g., Schwartz 2001; Duster 2005, 2006; Krieger 2000; Ossorio 2005). All of these authors criticise the above studies and the geographic clusters of genetic variation they identify, mainly because of flaws in the way samples are collected (see, e.g., Duster 2015) and how the data is ultimately analysed. The latter criticism has mainly focused on the program ‘Structure’ that is used by a majority of the studies mentioned above to churn out clusters of genetic variation (Bolnick 2008; Kalinowski, 2011; Fujimura et al. 2014). A telling criticism is that while Structure can be made to report that there are five main geographical clusters that show distinctive allele frequencies and which roughly match traditional notions of race (African, Asian, European, etc.), the programme can equally be set up to report any arbitrarily selected number of genetically different groups, as the user has to specify the number of clusters they are looking for before the Structure program is applied to any actual dataset.

Two interesting aspects of these discussions are that they a) usually only deal with one way of analyzing the biological reality of race classifications (as genetic) and b) adhere to a sharp distinction between race as biological reality or as social construct. Regarding a) several philosophers of biology have come up with alternative ways of thinking about a biological basis for race (for instance race as clades (Andreasen 1998), inbred lines (Kitcher 1999), or ecotypes (Pigliucci & Kaplan 2003)). This expansion of concepts brought with it the question of classificatory monism vs. pluralism, i.e., the question whether there is one privileged way of classifying race that somehow captures the ‘true nature’ of races (natural kinds) or whether there are several ways of doing so, depending on theoretical or practical interests/context (Gannett 2010). As Gannet argues, however, this focus on the monism/pluralism debate and on natural kinds comes at a cost, as it can mean that questions of practical significance are systematically ignored (2010). Regarding b), Gannett points out that drawing a sharp distinction between race as social construct or biological reality has not only been proven meaningless by recent work in population genetics but can also mean that the much messier reality of human history and diversity on this planet (and the complex interactions between scientific and social concepts of race) is being overlooked, leading to an impoverished analysis of the problems at hand (Gannett 2010).

Metagenomics (also referred to as ‘environmental’ or ‘community’ genomics) is a research field that aims to analyse the collective genomes of microbial communities. These communities are usually extracted from environmental samples, ranging from soil to water or even air samples. A major advantage of metagenomics is that it does not rely on techniques for culturing microbes. This is important because only an estimated 1%–5% of all microbes can be cultured at all (Amann et al. 1995), an issue that has been referred to as the ‘great plate count anomaly’ (Staley & Konopka 1985).[23]

The term ‘metagenomics’ was first coined in 1998 (Handelsman et al. 1998). The prefix ‘meta’ in ‘metagenomics’ can be read in at least three different ways (O’Malley 2013): 1) As referring to the fact that metagenomics transcends culturing limitations. 2) As emphasising the aggregate-level approach to biology that characterises metagenomics (looking beyond single entities (cells or genomes)). And 3) as referring to the goal of creating an overarching understanding of the genomic diversity of the microbial realm.

The methodology of metagenomics can be described as a four step process, consisting of: 1) the collection of environmental samples, 2) the isolation of microbial DNA from these samples, 3a) the direct analysis of the DNA or 3b) the creation of a genomic DNA library by fragmentation and insertion of the sampled DNA into suitable vectors (for instance plasmids that can be propagated in laboratory bacterial strains). These genomic libraries can then be used to 4a) sequence or 4b) perform a functional screen of the sampled genomic DNA. As the distinction between steps 4a) and 4b) already implies, metagenomics can be divided into a sequence- and a function-based approach (Gabor 2007; Sleator et al. 2008). In the former the collected DNA is sequenced so that potential genes present in the sample can be identified and, if feasible, the genomes of all the microbes that were present in the sample can be reconstituted.

The sequence-based approach is feasible due to the vastly reduced costs of sequencing and the increased computing power available. The goal of the approach is to get an idea of the diversity and distribution of microbes present in the sample and to also get an insight into their functioning (for instance by identifying metabolism-related enzymes that can give clues about the metabolic pathways active in the different microbes). This can give insights into the workings of the microbial ecosystem present in the sampled environment more generally.

In the functional approach the fragments of DNA that are stored in the library are used in what is often called a ‘functional screen’. To perform such a screen the researchers introduce the library plasmids into specific bacterial strains which then read and express any protein-coding sequence that might be present on the fragments, thereby producing the protein(s) the fragment codes for.[24] The key to a functional screen is to create conditions in which only those bacteria that express a protein with the function of interest can be singled out (for instance by making sure that only those cells survive). Once the cells are singled out the library plasmid they contain can be recovered and sequenced allowing the researcher to identify the protein(s) encoded by that fragment. Functional metagenomics is often used to identify novel microbial proteins that can be used in biotechnological and pharmaceutical contexts and it is not surprising that metagenomics was and still is of great interest to the biotechnological sector (Streit & Schmitz 2004; Lorenz & Eck 2005; Culligan et al. 2014; Ekkers et al. 2012).

One of the first actual (sequence-based) metagenomics projects was performed (yet again) by one of the pioneers of genomics, Craig Venter. The goal of Venter and his team was to sample microbes from the surface of the nutrient-poor Sargasso sea (Venter et al. 2004). This particular environment was chosen for this pilot study because it was expected to have a microbial community with relatively low diversity. This assumption turned out to be wrong and the project identified more than a million putative protein-coding sequences derived from at least 1800 different genomic species extracted from the sea water.

Another early metagenomics study consisted of the analysis of an acidophilic biofilm with low microbial diversity from an acid mine drain in California (Tyson et al. 2004). The analysed biofilm survives in one of the most extreme environments including a very low pH (i.e., high acidity), relatively high temperature and high concentration of metals. Importantly, this specific biofilm truly displays low complexity as it is composed of only three bacterial and two archaeal species. This simplicity greatly aided the analysis effort and allowed the researchers an almost complete recovery of two of the genomes and a partial recovery of the other three.

There have been many other metagenomics studies conducted since and there is little point in listing them here, as the list is growing by the month. One aspect of the ongoing research that is important to point out, however, is that the projects are becoming increasingly ambitious. The trend now is not just to have an integrated view on the genomes but to combine metagenomics with other techniques such as metabolomics (the assay of small molecules present in a system), metatranscriptomics (the analysis of all RNA transcripts of a community of microbes) and viromics (the analysis of all the viral genomes present in the system of interest) (see Turnbaugh & Gordon 2008; Bikel et al. 2015). In a sense the field is moving towards a highly integrated meta-Metagenomics approach (Dupré & O’Malley (2007) talk of “metaorganismal metagenomics”). This is also in line with the general trend towards big-data and discovery-based approaches in the life sciences (Ankeny & Leonelli 2015; Dolinski & Troyanskaya 2015; Leonelli 2014, 2016).

The rise of metagenomics is also linked to other changes in biological sciences more generally, especially the rise of systems biology starting around the year 2000 (which is itself closely linked to the development of genomics since the 1990s). O’Malley and Dupré (2005) point out that there is an important distinction to be made when looking at fields like systems biology, because there is not only a change in epistemology but also one in ontology. They therefore distinguish between pragmatic and systems-theoretic biologists. For the former, the idea of a ‘system’ is merely an epistemic tool. For the latter, the system becomes the new fundamental ontological unit. Doolittle and Zhaxybayeva (2010) claim that the same can be seen in metagenomics where there is a drive to see the community or the ecosystem as the new fundamental unit, and not the single species (see also Dupré & O’Malley 2007).

Moving away from a focus on single organisms or monogenomic species allows us to make better sense of many recent findings in microbiology (in which metagenomics has played a key role). Central to all of this are mobile DNA elements that can travel horizontally, meaning between different members of a community (including between different kinds of organisms). Obtaining such mobile DNA elements can have a crucial effect on the survival and reproduction capacity of the recipient cell. Mobile DNA can therefore be a key element in the evolutionary processes as it becomes a ‘communal resource’ (McFall-Ngai et al. 2013). Acquired antibiotic resistance is only one of many benefits cells are known to obtain through acquired DNA elements.

It is then the composition of functional elements that the community as a whole contains which is preserved over evolutionary time. And the community could be seen as an assembly of biochemical activities and not of distinct microbial lineages (see for instance Turnbaugh et al. 2009 and also Burke et al. 2011). The metagenome then becomes a ‘genome of communities’ and not a ‘community of genomes’ (Doolittle & Zhaxybayeva 2010). All of this also feeds into the more general, and currently very active, discussion about the problem of individuality in biology (Clarke 2010; Bouchard & Huneman 2013; Ereshefsky & Pedroso 2013; Guay & Pradeu 2015; SEP entry on the biological notion of individual).

Apart from these issues in biological ontology, there are also epistemological issues raised by metagenomics, namely the discrepancy between our ability to sequence DNA and to interpret it. These discussions about the challenges of DNA sequence interpretation are not just a problem for (meta)genomics and other -omics approaches, but also for biomedicine more generally and its push towards a truly personalised medicine. A key issue for this push is the discrepancy between the (ever-decreasing) costs of obtaining a personal genome sequence (Bennett et al. 2005; Mardis 2006; Check 2014a,b) and the high costs of making sure the data can be appropriately interpreted (Mardis 2006; Sboner et al. 2011; Phillips et al. 2015). This problem is related to the so-called ‘bioinformatic bottleneck’, the handling and the interpretation of the large amounts of sequence data that provides the main obstacle to progress (Green et al. 2011; Desai et al. 2012; Scholz et al. 2012; Marx 2013). In the days of next-generation sequencing the sequencing step itself is no longer the rate-limiting step.

4. Outlook

Genomics is now an integral part of all of the life sciences. Not that every life scientist is now a genomicist—there are still researchers who focus on the biochemistry, development, or the molecular networks of human cells and other organisms. But the DNA sequences of the human genome and the numerous model organisms that came out of the HGP enter every laboratory, if not on a daily basis than at least at some stage of every research project. The same applies to the maps of genetic variation that were discussed in Section 3 and to the (somewhat controversial) data on functional DNA elements that the ENCODE project generated (see the supplementary document The ENCODE Project and the ENCODE Controversy).

And it is not just the quantity of data and the many new “-omes” that researchers now work with that have transformed the science. As we have pointed out in several places, insights into the genome and its functioning have transformed researchers’ understanding of the entities and processes they are working with in the course of the last few decades. Part of this was also a transformation in our understanding of what it means to do ‘good’ science. What the HPG and its various offshoots have achieved, therefore, is to change the life sciences at the epistemological, the ontological and also the methodological level.

As so often, an interesting and even pressing question is where all of this is going. Predicting the future might not be possible, but there are trends that can be identified and which can be expected to follow a similar trajectory in the near future. One such trend is the drive for big data. ‘Big’ here refers not only to the quantity but also to the different types of data collected. A derivative of this big-data drive is the goal to integrate all of the diverse data and mould it into models that can further our understanding of biological systems and the prediction of their behaviour. The relatively young discipline of systems biology, which could not be discussed in detail in this entry, will certainly play a key role in this endeavour.