In order to study the landscape of positive and negative selection in cancer, we applied these approaches to a collection of 7,664 tumors from 29 cancer types from TCGA ( Table S1 ). Somatic mutations were re-called with our in-house algorithms across 24 cancer types to ensure comparability across tumor types and avoid biases from germline polymorphisms.

The first critical refinement is more comprehensive models for context-dependent mutational processes (). Traditional implementations of dN/dS use simplistic mutation models that lead to systematic bias in dN/dS ratios and can cause incorrect inference of positive and negative selection ( Figure S1 )—such biases have affected previous studies in this area (). Therefore, we use a model with 192 rate parameters that accounts for all 6 types of base substitution, all 16 combinations of the bases immediately 5′ and 3′ to the mutated base, and transcribed versus non-transcribed strands of the gene ( Figure S1 A). A second refinement is the addition of other types of non-synonymous mutations beyond missense mutations, including nonsense and essential splice site mutations (), and a method for small insertions and deletions (indels). Third, extreme caution was exercised during variant calling to avoid biases emerging from germline variants, because these have a much lower dN/dS ratio than somatic mutations. Misannotation of a germline polymorphism as a somatic mutation will bias somatic dN/dS downward; excessively filtering true somatic mutations that occur at positions known to be polymorphic in the population will bias somatic dN/dS upward ( Figure S1 B). For example, we have seen that germline contamination of the public mutation catalogs from several datasets in The Cancer Genome Atlas [TCGA], such as colorectal cancer and chromophobe renal cell carcinoma, generates a false signal of negative selection ( Figure S1 C). Fourth, to detect selection at the level of individual genes reliably, and particularly for driver gene discovery, we refined dN/dS to consider the variation of the mutation rate along the human genome. A simple way to do so is estimating a separate mutation rate for every gene (), but this approach has low sensitivity with typical sample sizes. Instead, we developed a statistical model (dNdScv) that combines the local observed synonymous mutation rate with a regression model using covariates that predict the variable mutation rate across the genome (). This approach has the advantage of optimizing the balance between local and global data on estimating background mutation rates to provide a statistically efficient inference framework for departures from neutrality ( Figure S2 ).

(D–G) Gamma distributions and log-likelihood surfaces of dNdScv on a number of genes and datasets. (D,F) Density functions of the Gamma distributions for substitutions and indels inferred by the negative binomial regression in dNdScv for two datasets (Lung-SCC and Pancancer). The Gamma distributions shown have a mean = 1, showing the spread around the mean observed across genes in each dataset. This reflects the extent of the variation of the mutation rate across genes that remains unexplained by sequence composition, signatures and covariates. (E,G) Log-likelihood ratio values for the number of missense mutations in three genes (PTEN, CDKN2A and MUC16) in the Lung-SCC (n = 167 samples) and Pancancer datasets (n = 7,664) under dNdSloc and dNdScv. The real observed number of missense mutations in each gene and dataset is shown as a vertical green line. The figures show how in small genes and/or small datasets, dNdScv has much narrower curves and much more significant P-values for cancer genes thanks to the Gamma constraint, while dNdScv and dNdSloc converge when the local number of synonymous mutations is sufficiently high. This adaptive behavior of dNdScv results from the joint likelihood equation.

(C) Comparison of the number of significant genes found by dNdScv (top) and the indel model (bottom) in their default configuration (unique-sites model for indels) when including and excluding MSI samples.

(B) Sensitivity of dNdScv and dNdSloc. The bar plot depicts the number of significant genes (q-value < 0.05) identified by both methods in the 29 TCGA datasets. Bars colored in a lighter shade show the number of significant genes that are present in the Cancer Gene Census version 73 (). dNdScv shows good specificity and sensitivity under all tested conditions ( STAR Methods ).

(A) QQ-plots for the different dN/dS models on a neutral dataset obtained by randomization of 107 melanoma whole-genomes from ICGC ( STAR Methods ). The dNdSunif model shows a great inflation of low P-values, leading to a large number of false positives after multiple testing correction (368 genes with q-value < 0.05), and should be generally avoided. In contrast, both dNdSloc and dNdScv behave as expected for a neutral dataset, yielding no significant hits after multiple testing correction.

Evaluation of the Relative Performance of the Three Different dN/dS Models for the Detection of Positive Selection at Gene Level, Related to Figure 2

(F) Simulations demonstrating the validity of estimating dN/dS at a cohort level, in heterogeneous cohorts of samples without patient-specific substitution models. The three scenarios simulated include extreme examples of heterogeneous mixtures of samples with variable signatures, numbers of mutations and selection. In each scenario, the correct fraction of mutations removed by negative selection across samples is shown as a blue horizontal line (right y axis). Estimated dN/dS values from five simulations of each scenario are shown as dots with CIs (left y axis).

(E) Corresponding estimates of the average number of driver coding substitutions per tumor. For the purpose of estimating the excess of mutations from dN/dS ratios, dN/dS values below 1 are set to 1. Error bars depict 95% CIs.

(D) Consistency between genome-wide dN/dS estimates using the trinucleotide and pentanucleotide substitution models across cancer types. Green dots represent genome-wide dN/dS estimates for each cancer type separately, and the orange dot depicts the pancancer estimates (using the 24 cancer types with CaVEMan mutation calls).

(C) Percentage of mutations from the public TCGA catalogs of somatic calls that overlap a common dbSNP site. Based on simulations, an overlap of 1%–3% might be expected depending on the dominant mutational signatures present in a dataset, but several public TCGA catalogs show a much higher overlap suggesting extensive germline SNP contamination. As predicted from (B), this leads to an artifactual signal of negative selection in these datasets ( STAR Methods ).

(B) Simulations of the impact on dN/dS of germline SNP contamination and SNP over-filtering in catalogs of somatic mutations. 10 neutral datasets were generated by local randomization of 607 cancer whole-genomes (). Datasets with varying degrees of germline SNP contamination were simulated by adding 5% or 10% of germline common SNPs (minor allele frequency > = 5%) from 1000 genomes phase 3 () to the neutral simulations. Datasets with varying levels of SNP over-filtering were simulated by removing any mutation from the neutral datasets that overlapped a polymorphic site in dbSNP build 146 (either using common sites or all sites) ().

(A) Impact of simplistic mutation models on the accuracy of dN/dS in different scenarios. Each boxplot represents the dN/dS ratios estimated from 100 neutral simulations of 10,000 random coding substitutions. To exemplify the impact on dN/dS of different mutational spectra, we simulated neutral datasets using the trinucleotide spectra observed in the three different cohorts of samples (pancancer, melanoma and lung adenocarcinoma). Different panels depict dN/dS ratios for missense (ω mis ) or nonsense (ω non ) mutations.

Building on previous work (), we use dN/dS, the normalized ratio of non-synonymous to synonymous mutations, to quantify selection in cancer genomes. This relies on the assumption that the vast majority of synonymous mutations are selectively neutral and hence a good proxy to model the expected mutation density (we address the accuracy of this assumption later; see also STAR Methods ). dN/dS has a long history in the study of selection in species evolution (), but several modifications are required for somatic evolution.

Detection of selection in traditional comparative genomics typically requires a measure of the expected density of selectively neutral mutations in a gene. In the context of cancer, a gene under positive selection will carry an extra complement of driver mutations in addition to neutral (passenger) mutations—it is this recurrence of mutations across cancer patients that has underpinned discoveries of cancer genes from the Philadelphia chromosome to modern genomic studies (). A gene subject to purifying selection of deleterious mutations would have fewer mutations than expected under neutrality ().

In stark contrast, cancer evolution shows a pattern in which dN/dS ratios are close to, but slightly above, 1 ( Figure 1 B). This pattern is universally shared across tumor types studied here and applies to both missense and truncating substitutions (nonsense and essential splice site mutations). This indicates that mutations under positive selective pressure are somewhat more numerous in cancers than mutations under negative selection, but the overall picture is close to neutrality. Importantly, similar values of dN/dS around or above 1 are found in somatic mutations detected in healthy tissues, including blood, skin, liver, colon, and small intestine () ( Figure 1 C). Although these data are still limited, dN/dS∼1 appears to characterize somatic evolution in normal somatic tissues as well as all cancers that we have studied so far.

Comparative genomic studies of related species typically reveal very low dN/dS ratios, reflecting that the majority of germline non-synonymous mutations are removed by negative selection over the course of evolution (). For example, comparison of orthologous genes from Escherichia coli and Salmonella enterica yields an average dN/dS∼0.06 across genes. This indicates that at least ∼94% of missense mutations have been removed by negative selection. The dN/dS ratio for nonsense mutations in common human germline polymorphisms is similarly low (dN/dS∼0.08). dN/dS ratios vary across species but a pattern of overwhelming negative selection invariably characterizes species evolution ( Figure 1 A).

As expected, depending on whether nonsense or missense mutations predominate, genes generally fall into two classes: oncogenes, with strong selection on missense mutations, or tumor suppressor genes, with stronger selection on truncating mutations ( Figure 2 B). Significant dN/dS ratios reach very high values in frequently mutated driver genes, often higher than 10 or even 100 ( Figure 2 B). This gives quantitative information about the proportion of driver mutations. For example, dN/dS = 10 for a gene evidences that there are ten times more non-synonymous mutations in the gene than expected under neutral accumulation of mutations, indicating that at least ∼90% of the non-synonymous mutations in the gene are genuine driver mutations ().

By definition, cancer genes are genes under positive selection in tumor cells. To show the ability of dN/dS to uncover cancer genes, we used dNdScv to identify genes for which dN/dS was significantly higher than 1, both across all 7,664 cancers and for each tumor type individually ( Figure 2 A). This revealed 179 cancer genes under positive selection at 5% false discovery rate. Of these, 54% are canonical cancer genes present in the Cancer Gene Census (). Using restricted hypothesis testing () on a priori known cancer genes identifies an additional 24 driver genes. Evaluation of genes not present in the Census reveals that most have been previously reported as cancer genes, have been found in other pan-cancer analyses, or have clear links to cancer biology () ( Table S2 ). Novel candidate cancer genes include ZFP36L1 and ZFP36L2, which have recently been shown to promote cellular quiescence and suppress S-phase transition during B cell development (). We find higher than expected rates of inactivating mutations in the two genes in several tumor types, suggesting that they have a tumor suppressor role. Other novel tumor suppressor genes identified here include KANSL1, a scaffold protein for histone acetylation complexes (), BMPR2, a receptor serine/threonine kinase for bone morphogenetic proteins, MAP2K7, involved in MAP-kinase signaling, and NIPBL, a member of the cohesin complex. Several of these genes were identified in a previous pan-cancer analysis ().

(A) List of genes detected under significant positive selection (dN/dS >1) in each of the 29 cancer types. Y axes show the percentage of patients carrying a non-synonymous substitution or an indel in each gene. The color of the dot reflects the significance of each gene. RHT, restricted hypothesis testing on known cancer genes ( Table S2 ).

Negative Selection Is Largely Absent for Coding Substitutions

Beckman and Loeb, 2005 Beckman R.A.

Loeb L.A. Negative clonal selection in tumor evolution. McFarland et al., 2014 McFarland C.D.

Mirny L.A.

Korolev K.S. Tug-of-war between driver and passenger mutations in cancer and other adaptive processes. Nowell, 1976 Nowell P.C. The clonal evolution of tumor cell populations. While some somatic mutations can confer a growth advantage, others may impair cell survival or proliferation. Clones carrying such mutations would senesce or die, with the result that the mutation would be lost from the catalog of variants seen in the eventual cancer. This negative or purifying selection will lead to dN/dS <1 in a given gene or set of genes if it occurs at appreciable rates. Negative selection on somatic mutations has been long anticipated () but not yet reliably documented in cancer genomes. This is due to the fact that statistical detection of lower mutation density than expected by chance requires large datasets and very careful consideration of mutation biases and germline SNP contamination.

95% = 1.0%–3.9%) show dN/dS ≥1.5, consistent with current estimates of the numbers of cancer genes. Only a tiny fraction of genes (∼0.14%; CI 95% = 0.02%–0.51%), equating to a few tens of genes, are estimated to exhibit negative selection with dN/dS ≤0.75 ( Figure 3 Negative Selection in Cancer Show full caption (A) Distributions of dN/dS values per gene for missense mutations in non-LOH regions. The real distribution is shown in gray and the distribution observed in a neutral simulation is shown in purple. (B) Underlying distribution of dN/dS values across genes inferred from the observed distribution. (C) Estimated percentage of genes under different levels of positive and negative selection based on the inferred dN/dS distribution in (B). (D) Average number of selected mutations per tumor based on the inferred distributions of dN/dS across genes, combining missense and truncating mutations from all copy number regions. Error bars depict 95% CIs. (E) Power calculation for the statistical detection of negative selection (dN/dS <1) as a function of the extent of selection (dN/dS) and the neutrally-expected number of mutations in a gene in a cohort. Shaded areas under the curves reflect power >80%. Vertical lines indicate the range in which the middle 50% and 95% of genes are in the dataset of 7,664 tumors. (F) Average mutation burden in genes grouped according to gene expression quintile and chromatin state. (G) Average dN/dS values for genes grouped according to gene expression quintile, chromatin state, and essentiality. Lek et al., 2016 Lek M.

Karczewski K.J.

Minikel E.V.

Samocha K.E.

Banks E.

Fennell T.

O’Donnell-Luria A.H.

Ware J.S.

Hill A.J.

Cummings B.B.

et al. Exome Aggregation Consortium

Analysis of protein-coding genetic variation in 60,706 humans. (H) Average dN/dS values for all mutations in genes found to be haploinsufficient in the human germline, including and excluding putative driver genes. Haploinsufficient genes are defined as those having a pLI score >0.9 in the ExAC database (). See also Figures S1 and S3 Figure S3 Supplementary Analyses on Negative Selection, Related to Figure 3 Show full caption (A–D) dN/dS distributions inferred for different mutation types and copy number states. These distributions, obtained as described for Figure 3 C, represent the percentage of genes estimated to be under a certain selection regime. The four distributions correspond to: missense (A) and truncating (B) substitutions in regions without loss of heterozygosity, and missense and truncating substitutions in haploid regions (C and D, respectively). Note that (A) is an extension of Figure 3 C, with an added middle bar for genes with dN/dS very close to 1 (0.9-1.1), which can be considered to evolve largely neutrally. Only samples with CaVEMan mutation calls, excluding melanoma samples, were considered for this analysis for the reasons explained in the Methods. For each figure, all mutations with the appropriate ploidy were included in the analysis and only genes with at least one mutation (either synonymous or non-synonymous) participate in the fitting of dN/dS distributions. Hence, the percentages of genes shown in the y-axes are relative to the total number of genes with at least one mutation in regions with the ploidy considered in each figure. Error bars depict 95% CIs. (E) Gene ontology groups deviating significantly from neutrality after removing known cancer genes. 27 gene ontology classes are found to be under significant positive selection after comprehensively removing 987 known putative cancer genes. This suggests the presence of undiscovered cancer genes in these functional groups. No gene ontology class was found to be under significant negative selection. Error bars depict 95% CIs. To determine the potential extent of negative selection, we first studied the distribution of observed dN/dS values per gene. There is considerable spread of these observed values around the neutral peak at dN/dS = 1.0 ( Figure 3 A), which at face value might suggest that many genes are under positive or negative selection. However, the limited numbers of mutations per gene make individual dN/dS values noisy, and we find that the observed distribution almost exactly matches that seen in simulations under a model where all genes are neutral. To formally estimate the fraction of genes under negative selection, we infer the underlying distribution of dN/dS values from the observed data using a binomial mixture model ( Figures 3 B and 3C). We find that the vast majority of genes are expected to accumulate point mutations near neutrally, with dN/dS∼1. A small fraction of genes (∼2.2%; confidence interval (CI)= 1.0%–3.9%) show dN/dS ≥1.5, consistent with current estimates of the numbers of cancer genes. Only a tiny fraction of genes (∼0.14%; CI= 0.02%–0.51%), equating to a few tens of genes, are estimated to exhibit negative selection with dN/dS ≤0.75 ( Figures 3 C and S3 A–S3D ).

95% = 0.31–1.16) has been lost by negative selection, accounting for <1% of all coding mutations. We note the formal possibility that dN/dS = 1 can occur when the numbers of positively and negatively selected mutations in a given gene are exactly balanced. This could lead us to underestimate the extent of negative selection but only if a large number of genes showed such an exact balance, which seems unlikely. These distributions also enable us to obtain approximate estimates of the average number of coding substitutions lost by negative selection per tumor ( Figure 3 D). On average, across this diverse collection of tumors, less than one coding substitution per tumor (0.55/patient; CI= 0.31–1.16) has been lost by negative selection, accounting for <1% of all coding mutations. We note the formal possibility that dN/dS = 1 can occur when the numbers of positively and negatively selected mutations in a given gene are exactly balanced. This could lead us to underestimate the extent of negative selection but only if a large number of genes showed such an exact balance, which seems unlikely.

Blomen et al., 2015 Blomen V.A.

Májek P.

Jae L.T.

Bigenzahn J.W.

Nieuwenhuis J.

Staring J.

Sacco R.

van Diemen F.R.

Olk N.

Stukalov A.

et al. Gene essentiality and synthetic lethality in haploid human cells. Although negative selection in cancers might be weak globally, it remains possible that negative selection may act in very specific scenarios, genes, or gene sets. No single gene had a dN/dS significantly <1 after multiple hypothesis testing correction, even if we boost our power by performing restricted hypothesis testing on 1,734 genes identified by in vitro screens as essential (). To address the possibility of making a type II inference error, we evaluated our statistical power to detect negative selection at the level of individual genes in this dataset ( Figure 3 E). We found that there is enough power to detect negative selection at dN/dS <0.5 on missense mutations for most genes in the genome, but we have less power for detecting negative selection acting on truncating mutations ( Figure 3 E). Thus, the lack of significant negative selection in any gene in the current dataset reveals that negative selection would be weaker than these detection limits.

Pleasance et al., 2010b Pleasance E.D.

Stephens P.J.

O’Meara S.

McBride D.J.

Meynert A.

Jones D.

Lin M.L.

Beare D.

Lau K.W.

Greenman C.

et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Schuster-Böckler and Lehner, 2012 Schuster-Böckler B.

Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Lee et al., 2010 Lee W.

Jiang Z.

Liu J.

Haverty P.M.

Guan Y.

Stinson J.

Yue P.

Zhang Y.

Pant K.P.

Bhatt D.

et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. We next examined whether specific groups of genes might be subject to negative selection, after excluding 987 putative cancer genes to avoid obscuring the signal of negative selection. Sets of genes that may be expected to be under stronger negative selection include highly expressed genes or genes in active chromatin regions. Lower mutation density has been observed in cancer genomes in highly expressed genes and open chromatin ( Figure 3 F) () and some have suggested that this may be a signal of negative selection (). However, we found that dN/dS values are virtually indistinguishable from neutrality for both missense and truncating substitutions across gene expression levels and chromatin states. This confirms that the lower density of mutations observed in open chromatin and highly expressed genes is due to lower mutation rates in these regions and not negative selection. The lack of detectable negative selection even extends to nonsense mutations in essential genes ( Figure 3 G; top panel). Gene sets grouped by gene ontology and functional annotation similarly revealed no clear evidence of negative selection ( Figure S3 E).

−4, Van den Eynden et al., 2016 Van den Eynden J.

Basu S.

Larsson E. Somatic mutation patterns in hemizygous genomic regions unveil purifying selection during tumor evolution. One reason for this unexpected weakness of negative selection in cancer could be that cancer cells typically carry two (or more) copies of most genes, reducing the impact of mutations inactivating a single gene copy. We used copy number data for the samples studied here to identify those coding mutations occurring in haploid regions of the genome. Strikingly, most missense and even truncating substitutions affecting the single remaining copy of a gene seem to accumulate at a near-neutral rate, suggesting that they are largely tolerated by cancer cells ( Figure 3 G; bottom panel). However, for essential genes in regions of copy number 1, nonsense substitutions do exhibit significantly reduced dN/dS, with approximately one-third of such variants lost through negative selection (dN/dS = 0.66, p value = 8.4 × 10 Figure 3 G). This result is consistent with the recent observation of weak signals of purifying selection on hemizygous genomic regions ().

Finally, analysis of mutations in human genes that are intolerant to heterozygous loss-of-function mutations in the germline also revealed no detectable negative selection in cancer cells. This applied similarly to both missense and truncating substitutions ( Figure 3 H).

McFarland et al., 2013 McFarland C.D.

Korolev K.S.

Kryukov G.V.

Sunyaev S.R.

Mirny L.A. Impact of deleterious passenger mutations on cancer progression. Morley, 1995 Morley A.A. The somatic mutation theory of ageing. Overall, these analyses show that negative selection in cancer genomes is much weaker than anticipated. With the exception of driver mutations, nearly all coding substitutions (∼99%) appear to accumulate neutrally during cancer evolution and are tolerated by cancer cells. Several factors are likely to contribute to the weakness of negative selection in cancer and somatic evolution, some highlighted before (). These include, among other factors: (1) the buffering effect of having two or more copies of most genes; (2) the fact that, for any given somatic lineage, a large number of genes are likely to be dispensable (); (3) the frequent hitchhiking with driver mutations, which enables weakly deleterious mutations not yet expunged to be fixed in a cancer population; (4) moderately high mutation rates per division and asexual reproduction of cancer cells, which prevent deleterious mutations from being separable from other variants in the genome and lead to their progressive accumulation (known as Muller’s ratchet); and (5) differences in population size and structure, such as stem cell niches, which are likely to exacerbate genetic drift.