Abstract Ten years on from the finishing of the human reference genome sequence, it remains unclear what fraction of the human genome confers function, where this sequence resides, and how much is shared with other mammalian species. When addressing these questions, functional sequence has often been equated with pan-mammalian conserved sequence. However, functional elements that are short-lived, including those contributing to species-specific biology, will not leave a footprint of long-lasting negative selection. Here, we address these issues by identifying and characterising sequence that has been constrained with respect to insertions and deletions for pairs of eutherian genomes over a range of divergences. Within noncoding sequence, we find increasing amounts of mutually constrained sequence as species pairs become more closely related, indicating that noncoding constrained sequence turns over rapidly. We estimate that half of present-day noncoding constrained sequence has been gained or lost in approximately the last 130 million years (half-life in units of divergence time, d 1/2 = 0.25–0.31). While enriched with ENCODE biochemical annotations, much of the short-lived constrained sequences we identify are not detected by models optimized for wider pan-mammalian conservation. Constrained DNase 1 hypersensitivity sites, promoters and untranslated regions have been more evolutionarily stable than long noncoding RNA loci which have turned over especially rapidly. By contrast, protein coding sequence has been highly stable, with an estimated half-life of over a billion years (d 1/2 = 2.1–5.0). From extrapolations we estimate that 8.2% (7.1–9.2%) of the human genome is presently subject to negative selection and thus is likely to be functional, while only 2.2% has maintained constraint in both human and mouse since these species diverged. These results reveal that the evolutionary history of the human genome has been highly dynamic, particularly for its noncoding yet biologically functional fraction.

Author Summary Nearly 99% of the human genome does not encode proteins, and while there recently has been extensive biochemical annotation of the remaining noncoding fraction, it remains unclear whether or not the bulk of these DNA sequences have important functional roles. By comparing the genome sequences of different species we identify genomic regions that have evolved unexpectedly slowly, a signature of natural selection upon functional sequence. Using a high resolution evolutionary approach to find sequence showing evolutionary signatures of functionality we estimate that a total of 8.2% (7.1–9.2%) of the human genome is presently functional, more than three times as much than is functional and shared between human and mouse. This implies that there is an abundance of sequences with short lived lineage-specific functionality. As expected, most of the sequence involved in this functional “turnover” is noncoding, while protein coding sequence is stably preserved over longer evolutionary timescales. More generally, we find that the rate of functional turnover varies significantly across categories of functional noncoding elements. Our results provide a pan-mammalian and whole genome perspective on how rapidly different classes of sequence have gained and lost functionality down the human lineage.

Citation: Rands CM, Meader S, Ponting CP, Lunter G (2014) 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage. PLoS Genet 10(7): e1004525. https://doi.org/10.1371/journal.pgen.1004525 Editor: Mikkel H. Schierup, Aarhus University, Denmark Received: December 4, 2013; Accepted: June 5, 2014; Published: July 24, 2014 Copyright: © 2014 Rands et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: CMR, SM and CPP are funded by the UK Medical Research Council, and CPP and SM are also funded by the EU (Gencodys). Additional funding was provided by the ERC. GL was funded by The Wellcome Trust grant 090532/Z/09/Z. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction “What proportion of the human genome is functional?” remains a contentious question [1]–[3]. In great part this reflects the use of definitions of ‘function’ that differ from the traditional definition that is based on fitness and selection (see e.g. [4] for a discussion). For instance, equating functionality with annotation by at least one of the ENCODE consortium's biochemical assays [5] results in approximately 80% of the human genome being labeled as functional [1], [6]. While this approach has the advantage of being empirical, it makes the definition of functionality dependent on the choice of experiments and details such as P value cutoffs. It is also questionable whether, for instance, introns should be classified as functional based merely on their transcription [2], [4]. By contrast, evolutionary studies often equate functionality with signatures of selection. While it is undisputed that many functional regions have evolved under complex selective regimes including selective sweeps [7] or ongoing balancing selection [8], [9], and it appears likely that loci exist where recent positive selection or reduction of constraint has decoupled deep evolutionary patterns from present functional status [10], [11], it is widely accepted that purifying selection persisting over long evolutionary times is a ubiquitous mode of evolution [12], [13]. While acknowledging the caveats, this justifies the definition of functional nucleotides used here, as those that are presently subject to purifying selection. This is of course not useful as an operational definition, as selection cannot be measured instantaneously. Instead, most studies define functional sites as those subject to purifying selection between two (or more) particular species. Studies that follow this definition have estimated the proportion of functional nucleotides in the human genome, denoted as α sel [14], [15], between 3% and 15% ([3] and references therein, [16]). Since each species' lineage gains and loses functional elements over time, α sel needs to be understood in the context of divergence between species. The divergence influences the estimate of α sel in two ways. On the one hand, constrained sequence between closely related species, including lineage-specific constrained sequence, is harder to detect than more broadly conserved sequence because of a paucity of informative mutations, which reduces detection power. On the other hand, estimates of constraint between any two species will only include sequence that was present in their common ancestor and that has been constrained in the lineages leading up to both extant species' genomes, with the consequence that turnover of functional sequence leads to diminishing α sel estimates as the species divergence increases. Assuming that the first effect can be controlled for, higher estimates of sequence constraint that are obtained between more closely related species [15], [17] are thus indicative of the turnover of functional sequence [15]. Here we understand turnover to mean the loss or gain of purifying selection at a particular locus of the genome, when changes in the physical or genetic environment, or mutations at the locus itself, cause the locus to switch from being functional to being non-functional or vice versa. Two previous studies have made quantitative estimates of the overall rate of turnover ([15], [17], reviewed in [3]). The estimate by Smith et al. (2004) [17] was derived from an analysis of point mutations in alignments across a 1.8 Mb genomic region. While a high rate of turnover was inferred, the authors emphasised the preliminary nature of their work as a consequence of the limited amount of data available to them at that time. Later, Meader et al. (2010) [15] performed genome-wide analysis with a neutral indel model (see [18], here referred to as NIM1) to estimate the fraction, termed α selIndel , of human sequence that was constrained with respect to insertions or deletion mutations (indels). This study also found a high rate of turnover, and estimated using two ad hoc heuristic approaches that 6.5–10% of the human genome is functional. Extrapolations using these data subsequently suggested that 10–15% of the human genome is presently functional [3]. NIM1 is a quantitative model describing the distribution of distances between neighbouring indels (intergap segments; IGSs) in neutrally evolving sequence, which provides an excellent description of the observed frequency of medium-sized IGSs. However, across whole genome alignments longer IGSs are strikingly overrepresented compared to this expectation under neutrality, presumably as a result of the presence of functional genomic segments under purifying selection in which indel mutations are unlikely to become fixed. By quantifying this overrepresentation it is possible to estimate α selIndel , the fraction of nucleotides contained within these functional segments. The model (which also accounts for G+C content and sex chromosome-dependent mutational biases) performs well for simulated data, and accurately identifies coding regions and ancestral repeats as highly conserved and neutrally evolving, respectively [15], [18]. However, some concerns about the model's derivation and the quality of whole-genome alignments we used were subsequently brought to our attention, which motivated us to initiate this study. Here we present improved methods for the estimation of α selIndel and the inference of functional turnover, building on our previous approaches [15], [18]. We apply these improved approaches to pairwise alignments between the genomes of diverse eutherian mammals, and we estimate that 7.1–9.2% of the human genome is presently subject to purifying selection, equating to 220–286 Mb of constrained sequence. We also take advantage of the additional high-quality eutherian genome sequences that have become available since our previous study to provide improved estimates of the rate of turnover of functional sequence in these species. Improvements in biological and biochemical annotation of genomic sequence mean that we can investigate turnover rates within particular classes of functional elements, such as coding sequences, DNase 1 hypersensitivity sites (DNase HSs), transcription factor binding sites (TFBSs), enhancers, promoters, and long noncoding RNAs (lncRNAs). We find striking differences between the functional element classes; in particular constrained coding sequences are much more evolutionary stable than constrained noncoding sequences, and lncRNAs show the most rapid rate of turnover of all the noncoding element types.

Discussion The question of what fraction of the human genome sequence are mutations preferentially purged owing to their deleterious effect has remained contentious ever since the first estimate was made in 2002 [22]. At that time it was not well appreciated that the amount of human constrained sequence that is also constrained in mouse is a minority (69 Mb; this study) of all human constrained sequence, owing to the relatively rapid gain and loss of functional sequence in their two lineages since their last common ancestor. We find that NIM1-constrained sequence lacking evidence for pan-mammalian conservation is enriched for sequences with experimental evidence for biochemical activity, and we provide a detailed argument indicating that this is incompatible with the notion of technical artefacts causing the observed signature of turnover (Text S6). Extensive simulations indicating that estimates of constrained sequence are consistent across the divergence range we investigate further support this conclusion. Our estimate that 7.1–9.2% of human genomes is subject to contemporaneous selective constraint considerably exceeds previous estimates and falls short of others [3], [23]. We have shown that our method's previous estimates for specific species pairs, as well as the calculation that suggested 10–15% of the human genome is currently under negative selection were inflated [3], in large part owing to inaccuracies in whole genome alignments upon which our estimates were based. The problems associated with using whole-genome alignments could be circumvented entirely by instead using polymorphism data within a single species. However, this approach is technically highly challenging, and results have so far been controversial [16], [24], [25]; in addition this approach is not informative about functional turnover. Other published estimates [12], [18], [26] are lower because they, by design, were not sensitive to lineage-specific constrained sequence. Our current estimates have their own particular caveats. While our results show that turnover is a real and substantial effect, simulations show that NIM1 underestimates the true amount of mutually constrained sequence to an extent that shows some dependence on the divergence. While simulations and theory indicate that point estimates of constraint remain conservative, the possibility of an upward bias in the inferred rate of turnover cannot be excluded, which in turn could lead to upwardly biased extrapolations of present-day constraint. In addition, the assumptions of the turnover model, in particular that all elements within a class are subject to the same rate of turnover, clearly are only approximately valid. These potential sources of error are not reflected in our confidence estimates (Table S6). Our estimate that 7.1%–9.2% of the human genome is functional is around ten-fold lower than the quantity of sequence covered by the ENCODE defined elements [1], [5], [6]. This indicates that a large fraction of the sequence comprised by elements identified by ENCODE as having biochemical activity can be deleted without impacting on fitness. By contrast, the fraction of the human genome that is covered by coding exons, bound motifs and DNase1 footprints, all elements that are likely to contain a high fraction of nucleotides under selection, is 9%. While not all of the elements in these categories will be functional, and functional elements will exist outside of these categories, this figure is consistent with the proportion of sequence we estimate as being currently under the influence of selection. As expected, turnover has occurred least in protein coding sequence, and thus has been most concentrated on noncoding sequence (Figure 4). For example, of the 43.5 Mb of sequence annotated by the ENCODE project as being within a human TFBS peak and that we find to be constrained (19.3% of the total extent of ENCODE TFBS peaks), only a third (30.6%; 13.3 Mb) is identified by NIM1 as being constrained in both human and mouse. A slightly higher proportion (45.6%; 19.8 Mb) is constrained in human and dog, presumably reflecting these species' lower divergence. These estimates are in good agreement with previous experimental findings: for instance 23–41% of TF binding events have been found to be conserved across human, dog and mouse for four liver TFs [27], while for two additional liver TFs, 7–14% of TF binding events are shared between human and mouse, and 15–20% between human and dog [28]. The phenomenon of turnover is well supported by both anecdotal evidence [27]–[29] and by broader studies of particular classes of elements, mostly TFBSs and enhancer elements [30]–[32]. The class of functional element inferred to turnover fastest was that of lncRNAs, again consistent with observations that most human lncRNAs are primate-specific and only 19% of lncRNAs are conserved over more than 90 My [33]. What our approach cannot clarify is to what extent the observed turnover at the sequence level amounts to different sequences encoding equivalent function [29], [30], or species-specific functional change [16], [31], [34]. Several lines of evidence, both from anecdotal [29] and broader [30], [31] studies of TFBSs, indicate that a large fraction of sequence changes involving TFBSs preserve function. For example, some deeply conserved transcription factors have species-specific binding sites in the vicinity of orthologous genes [27], [28] implying that despite their sequence divergence, the different DNA binding sites confer equivalent functions (on orthologous genes) in different lineages. Comprehensive studies of human and mouse embryonic heart enhancers found these to be weakly conserved [35], [36], despite human enhancers sequences largely driving expected tissue-specific expression in mouse embryonic heart tissue [36]. Another study found that two mammalian hypothalamic enhancers have no homolog across non-mammalian vertebrates, yet are still able to drive specific expression patterns in zebrafish neurons [37]. These findings are consistent with gene expression evolution being shaped predominantly by stabilizing selection on the expression level [38], while evolution on the sequence level may involve an interplay between fixation of weakly deleterious mutations through drift, and weak positive selection on compensatory mutations [39]. However, not all TFBS turnover events are neutral or nearly neutral on the level of gene expression, and the fraction of such events that change gene expression may be substantial [31]. More generally, lineage-specific sequence is clearly a likely substrate for lineage-specific biology [16], [34], although adaptations to pre-existing functional sequence remain an alternative plausible mode for creating species-specific change [40]. Nevertheless, the sheer ubiquity of sequence turnover, and the clear potential for substantial regulatory change resulting from it, suggests that many aspects of noncoding human biology will not be fully recapitulated by orthologous sequence in eutherian model organisms, including mouse. Thus, our findings could provide a more quantitative basis for assessing the relevance of model organisms to specific questions of human biology.

Materials and Methods Sequence data We restricted our analyses to genome assemblies that have been sequenced at relatively high coverage, not using for example the 2-fold coverage assemblies of mammalian genomes [41], to minimize the impact of sequencing and assembly errors. From the UCSC Genome Informatics website (http://genome.ucsc.edu/), we acquired softmasked versions of the following genome assemblies: human (hg19), mouse (mm10, mm9, and mm8), rat (rn5), cattle (bosTau7), dog (canFam2), horse (equCab2), guinea pig (cavPor3), rabbit (oryCun2), bushbaby (otoGar3), panda (ailMel1), and rhino (cerSim1). We also acquired a Ferret genome assembly (M_putorius_furo_v1) produced by the Broad Institute. We softmasked the ferret genome assembly using RepeatMasker with carnivore repeat libraries [42]. Alignment construction and trimming When available, whole genome pairwise alignments were downloaded from the UCSC Genome Informatics website (http://genome.ucsc.edu). Otherwise, we constructed alignments following UCSC's protocol [43]. Initial alignments were constructed with LASTZ (http://www.bx.psu.edu/miller_lab/), a derivative of BLASTZ [44], and these alignments were subsequently chained and netted using tools from UCSC (Table S1 for alignment parameterisations). We trimmed each of the whole genome alignments once we found that UCSC alignments contained a minority of poorly aligning sequence (Figure S1, Table S2). Each alignment was rescored to generate a new substitution matrix using a log-odds ratio approach as described previously [45]. We did not impose symmetry on the scoring matrixes with respect to strand or species. We then used the generated substitution matrix, with gap penalties derived from the original alignments, to discard (“trim”) the maximal non-positively scoring terminal segments of the alignment blocks and any non-positively scoring inter-gap segments. Trimming removes terminal and internal alignment segments that are more likely to have arisen under a model of independent evolution than of evolution from a common ancestor. Subsequent analyses were carried out following the discarding of all trimmed sequence. We also excluded alignments that were led by sequence not mapped to chromosomes. We did not exclude non-reciprocally aligning sequence or sequence that lay within known indel hotpot locations as we found removing such sequence had relatively small effects on estimates of α selIndel (Table S3). An updated Neutral indel model 1 (NIM1) The neutral indel model of Lunter et al. (2006) [18] (NIM1) estimates the genomic fraction (α selIndel ) of sequence constrained with respect to indels between a species pair. The model examines the distribution of IGSs from a set of whole genome pairwise alignments using a regression approach over a range of medium IGS lengths to estimate the parameters of a predicted geometric distribution of IGSs in neutral sequence. α selIndel in bp is then estimated by summing up the quantity x - 2K over all the long IGSs inferred to be in excess of predictions under neutral evolution. Here where x is the length of the overrepresented IGS, and K is the estimated mean spacing between indels (“neutral overhang”). 20 equally populated G+C content bins are analysed separately to account, in part, for mutational variation that correlates with G+C content. The X chromosome is also analysed separately. A detailed description of the model is given in the original publications [15], [18]. However, two theoretical issues of the model have not been described previously. These are: (A) that thresholding biases the expected lengths of the neutral overhang and, (B) that neutral segments are depleted from the background distribution due to the presence of constrained segments, changing the expected neutral distribution of IGS lengths; resolution of the two issues is described in Text S1. Our implementation of the NIM1 differs from that of the preceding studies in the manner in which we calculate the bounds of the estimates. The previous approaches constructed the upper and lower bound estimates based on the uncertainty in the degree of clustering of functional elements. The lower bound estimate was derived assuming that functional elements are unclustered (each overrepresented IGS contributes x - 2K bp towards the α selIndel estimate), while the upper bound was derived assuming a high degree of clustering (each overrepresented IGS contributes x - K bp). In our revised approach, we construct a 95% confidence interval around the lower x - 2K bp estimate. The impact of this change on α selIndel estimates can be seen in the simulation study (Table S5). We made this conservative modification to the NIM1 for five reasons: Firstly, the previous upper bound estimate assumes an unrealistically high degree of clustering of functional elements. Secondly, only our modified estimate is always conservative under all the simulation scenarios, whereas the previous implementation of the NIM1 sometimes overestimates the true value of α selIndel (Table S5). Thirdly, altering the clustering of functional elements in the simulations actually has only a minor effect on the estimated quantities of constrained sequence (Figure S11). Fourthly, in addition to the clustering of functional elements, other parameterisations also influenced α selIndel estimates (Table S5), yet the uncertainty in the values of these parameters was not also incorporated into the NIM1 estimate. Instead, we now choose to incorporate the full extent of uncertainty into the simulations. Finally, by providing a 95% confidence interval for the α selIndel estimate of NIM1, we have an estimate that is directly comparable to the NIM2 estimates. Estimating the fraction of constraint in subsets of the genome We have described above how NIM1 is used to estimate the fraction α selIndel of constrained bases within a genome G consisting largely of neutrally evolving sequence. To estimate α selIndel within a subset S⊆G that is not dominated by neutrally evolving sequence, for instance when estimating α selIndel within coding sequence, we instead estimate α selIndel within the subsets G and G\S; the difference between the resulting estimates is the estimate of α selIndel within S. Estimating the neutral substitution rates We extracted ancestral repeat (AR) alignments from the trimmed whole genome alignments using RepeatMasker annotations to identify transposable element and repeat-derived sequence [42]. We then calculated the substitution rate for the alignments using the HKY85 model applied in the PAML package BASEML [46]. We also estimated synonymous substitution rates (dS) across protein coding regions for some species pairs. Estimates of dS for a species pair were made by calculating the median dS of all one-to-one gene orthologs in the Ensembl Compara database with dS<1. Nucleotide substitution rates in AR sequences are very similar to estimates of the synonymous substitution rate (dS) (Figure S5), hence our results appear insensitive to the choice of neutral sequence standard. Modelling turnover The time-homogeneous turnover model makes the following assumptions: for a particular class of functional elements, both the total amount of functional sequence and the rate of turnover are constant in time, and the turnover rate (weighted by the length of the elements) is identical for all elements in the class. Specifically, within a class of functional sites comprising a nucleotides, in a small time interval dt a number a b dt of sites dispense with function, while an identical number gain function. Note that to arrive at this result we make an “infinite sites” assumption, namely that the genome can be considered infinitely large compared to a; otherwise one would need to account for reversions back to functionality of neutral but previously functional material. Fitting the data to this model under the assumption of independent normally distributed errors in the observations provides estimates and error bounds on parameters a and b. Annotations Coding sequence for human (hg19), mouse (mm10), and dog (canFam2) and UTR annotations for human (hg19) were obtained from Ensembl version 72 (http://www.ensembl.org/index.html). UTR sequence that overlapped coding sequence was not considered in the UTR analyses. Human (hg19) PhastCons conserved elements were taken from the vertebrate PhastConsElements46way track downloaded from UCSC Genome Informatics (http://genome.ucsc.edu/). Human (hg19) GERP++ conserved elements were downloaded from the Sidow laboratory website (http://mendel.stanford.edu/SidowLab/downloads/gerp/). Repetitive element annotations for all species were taken from RepeatMasker [42]. Other human (hg19) annotations were taken from the ENCODE data available at UCSC Genome Informatics (http://genome.ucsc.edu/ENCODE/). Specifically, the TFBS data and DNase HS data were acquired from the ENCODE clustered merged sets (wgEncodeRegTfbsClusteredV2.bed and wgEncodeRegDnaseClusteredV2.bed respectively). Promoter and enhancer elements were extracted from the ENCODE HMM Chromatin State segmentations tracks, and merged across these samples: wgEncodeBroadHmmGm12878HMM.bed.gz, wgEncodeBroadHmmH1hescHMM.bed.gz, wgEncodeBroadHmmHepg2HMM.bed.gz, wgEncodeBroadHmmHmecHMM.bed.gz, wgEncodeBroadHmmHsmmHMM.bed.gz, wgEncodeBroadHmmHuvecHMM.bed.gz, wgEncodeBroadHmmK562HMM.bed.gz, wgEncodeBroadHmmNhekHMM.bed.gz, and wgEncodeBroadHmmNhlfHMM.bed. We display the results from analysis of the set of Hangauer et al. (2013) [21] lncRNAs in Figure 3. We also used the smaller set of ENCODE lncRNAs in Figure S9.

Acknowledgments We are grateful to Phil Green for extensive discussions and comments which contributed several key ideas. We thank Wilfried Haerty and Yang Li for useful discussions. We are grateful to the Broad Institute for providing early access to their Ferret genome assembly.

Author Contributions Analyzed the data: CMR GL. Contributed reagents/materials/analysis tools: GL CMR SM. Wrote the paper: CMR CPP GL. Contributed key ideas and discussion points: CMR CPP GL.