Somatic mutation calling is essential for the proper diagnosis and treatment of most cancer patients. Wood et al. developed a machine learning approach called Cerebro that increased the accuracy of calling validated somatic mutations in tumor samples from cancer patients. Cerebro outperformed six other mutation detection methods by better distinguishing technical sequencing artifacts. An analysis of non–small cell lung cancer and melanoma patient samples revealed that Cerebro more accurately classified patients according to their immunotherapy response, suggesting that the authors’ mutation calling approach could favorably affect patient care.

Variability in the accuracy of somatic mutation detection may affect the discovery of alterations and the therapeutic management of cancer patients. To address this issue, we developed a somatic mutation discovery approach based on machine learning that outperformed existing methods in identifying experimentally validated tumor alterations (sensitivity of 97% versus 90 to 99%; positive predictive value of 98% versus 34 to 92%). Analysis of paired tumor-normal exome data from 1368 TCGA (The Cancer Genome Atlas) samples using this method revealed concordance for 74% of mutation calls but also identified likely false-positive and false-negative changes in TCGA data, including in clinically actionable genes. Determination of high-quality somatic mutation calls improved tumor mutation load–based predictions of clinical outcome for melanoma and lung cancer patients previously treated with immune checkpoint inhibitors. Integration of high-quality machine learning mutation detection in clinical next-generation sequencing (NGS) analyses increased the accuracy of test results compared to other clinical sequencing analyses. These analyses provide an approach for improved identification of tumor-specific mutations and have important implications for research and clinical management of cancer patients.

Here, we describe the development of a novel method for somatic mutation identification that uses machine learning strategies to optimize sensitivity and specificity for detection of true alterations. We compared the overall accuracy of this approach to existing methods for somatic mutation identification using simulated and experimentally validated whole-exome and targeted gene analyses. We evaluated the overall concordance of our method with mutation calls from TCGA exomes, examined the underlying causes of erroneous calls, including those in actionable driver genes, and assessed the effects of discordant mutation calls on tumor mutational burden (TMB) and clinical response to cancer immunotherapy. To assess the importance of high-quality mutation analysis in clinical testing, we performed head-to-head comparisons of clinical sequencing with or without our machine learning approach. Overall, these analyses highlight the importance of improved somatic mutation detection for the interpretation of large-scale genome studies and for the application of these approaches to clinical practice.

On the basis of these large-scale genomic efforts, targeted NGS approaches in clinical oncology have begun to be rapidly adopted to identify genetic alterations and make decisions regarding patient therapy and management ( 3 , 5 , 44 – 47 ). Hundreds of laboratories in the United States provide NGS cancer profiling, issuing tens of thousands of reports each year. Nearly all NGS-based analyses for clinical cancer mutation detection are Clinical Laboratory Improvement Amendments/College of American Pathologists (CLIA/CAP)–regulated laboratory-developed tests and have not been approved by the FDA for use as in vitro diagnostics for analyses across tumor types ( 48 – 51 ). The importance of analyzing matched tumor and normal sequencing from the same individual has been reported for improved identification of somatic and germline alterations in cancer patients ( 5 , 52 , 53 ), although few laboratories providing laboratory-developed tests use these approaches. Despite the development of recommendations for validation of NGS testing ( 54 ), many challenges remain in somatic mutation detection, including sensitive detection of alterations that are subclonal or in low-purity tumor samples, as well as distinguishing these from germline changes or from artifacts related to polymerase chain reaction (PCR) amplification or sequencing. Head-to-head comparisons of laboratory-developed NGS tests have reported a wide range of mutation concordance estimates, leading to concern regarding the accuracy of such tests ( 55 – 58 ).

Systematic discovery of somatic alterations in new driver genes and pathways began with large-scale sequencing analyses in various human cancers using Sanger sequencing and NGS ( 21 – 30 ). These efforts were extended by The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium, which profiled the molecular landscape of 33 common cancer types and more than 10,000 patients ( 31 , 32 ). Because of the scale and complexity of TCGA, initial analysis efforts to characterize somatic mutations within tumors were typically unique to each cancer type ( 33 – 42 ). More recently, the PanCanAtlas project has generated a unified set of consensus mutation calls (Multi-Center Mutation Calling in Multiple Cancers or “MC3”) across the entire TCGA cohort ( 43 ), although little, if any, validation of mutations has been performed for these data sets.

Cancer therapies targeting specific driver genes altered in tumors are now commonly used in the clinic ( 9 , 10 ). Examples include vemurafenib and trametinib in BRAF-mutated melanoma ( 11 ), and erlotinib and osimertinib in epidermal growth factor receptor (EGFR)–mutated non–small cell lung cancer (NSCLC) ( 12 , 13 ). The immune checkpoint inhibitor pembrolizumab recently received accelerated approval by the U.S. Food and Drug Administration (FDA) for treatment of patients with solid tumors from any tissue type where the tumor was determined to have mismatch repair deficiency (dMMR) resulting in microsatellite instability (MSI-H) and higher mutation loads ( 14 ). High tumor mutation burden, resulting from repair defects or exposure to mutagens, has been associated with durable clinical response to a variety of immune checkpoint inhibitors, including nivolumab, ipilimumab, and atezolizumab ( 15 – 19 ), and may serve as a predictive biomarker for treatment response ( 20 ).

Comprehensive molecular profiling of cancer through next-generation sequencing (NGS) approaches is increasingly used in oncology for diagnostic and therapeutic management decisions ( 1 – 8 ). Tumor-specific (somatic) mutations, including single-nucleotide alterations and small insertions or deletions, are known to affect key driver genes early during tumorigenesis. Over time, cancers accumulate additional mutations that may influence the underlying biology of the tumor cells, representing new opportunities for therapeutic intervention.

In our second comparison, we identified 46 alterations among the 15 patients ( Table 3 ). The CancerSELECT 125 panel achieved 100% sensitivity and positive predictive values for all alterations identified. The MSK-IMPACT assay, which included a manual review of all candidate alterations, identified all but one missense mutation in PMS2 that had been detected by CancerSELECT 125. This alteration was confirmed to be a bona fide change using ddPCR (fig. S8). To further demonstrate the importance of the Cerebro base caller within CancerSELECT 125, we evaluated the NGS data from our analyses using two mutation callers and demonstrated superior sensitivity and positive predictive values for Cerebro (table S13). Overall, these analyses establish that the Cerebro-trained CancerSELECT 125 had superior performance compared to other commonly used clinical NGS platforms, and highlight the importance of high-accuracy somatic mutation detection for identification of bona fide alterations in clinical NGS assays.

The first analysis resulted in a set of true somatic mutations consisting of 30 SBSs and six indels in commonly analyzed regions among the 19 patients ( Table 2 ). The CancerSELECT 125 panel using Cerebro achieved 100% sensitivity and positive predictive values for all alterations, including SBSs and indels, and outperformed Oncomine and TruSeq. The Oncomine assay resulted in a sensitivity of 97% for SBSs and 83% for indels. The two alterations detected by CancerSELECT 125 and missed by Oncomine were confirmed using ddPCR (fig. S8). The TruSeq panel was unable to detect 15 SBSs and three indels, resulting in a lower sensitivity of 50% for both types of alterations ( Table 2 and fig. S9). The CancerSELECT 125, Oncomine, and TruSeq approaches resulted in 0, 17, and 175 false positives, respectively. Seventeen false-positive SBSs reported by the Oncomine or TruSeq panels were single-nucleotide polymorphisms (SNPs) that could be identified in databases of known germline variants and had been removed by the Cerebro approach. However, the large number of false-positive results identified by the TruSeq panel suggests additional underlying technical causes, including PCR artifacts and sequencing error, resulting in a positive predictive value of 8% ( Table 2 and fig. S9). Because of the high number of TruSeq nongermline false positives, including in putative driver hotspots, we performed a detailed evaluation of these candidate mutations in the NGS data obtained using the CancerSELECT 125 and Oncomine platforms, and found no evidence of these mutations predicted by the TruSeq approach (table S12).

To evaluate the effect of somatic mutation detection methods on NGS clinical cancer sequencing tests, we compared the Cerebro method trained for PGDx CancerSELECT 125 to three approaches for mutational profiling: the Thermo Fisher Oncomine Comprehensive Assay, the Illumina TruSeq Amplicon—Cancer Panel, and the MSK-IMPACT (Memorial Sloan Kettering–Integrated Mutation Profiling of Actionable Cancer Targets) panel. The analysis was performed using two sample cohorts, including replicates of formalin-fixed paraffin-embedded (FFPE) or frozen tumor samples from 22 lung cancer patients analyzed by CancerSELECT 125, Oncomine, and TruSeq or replicates of 15 breast, lung, and colorectal cancers with matched normal samples analyzed by both CancerSELECT 125 and MSK-IMPACT. For each sample, adjacent interspersed FFPE sections were evaluated using these approaches. Samples from three patients could not be analyzed using the TruSeq Amplicon—Cancer Panel due to insufficient DNA and were excluded from comparative analyses. Putative somatic mutations for the remaining patients were used for concordance evaluation and were limited to genomic regions comprising those common to the approaches in each analysis. Mutations were considered true positives if they were detected in at least two assays or if identified in only one of the assays and independently confirmed using ddPCR.

Given the association of TMB with clinical outcome in patients treated with immune checkpoint blockade ( 8 , 17 , 19 ), we wondered whether our analyses could be used to improve the classification of patients into mutator groups with different clinical outcomes. Cerebro analyses revealed that the average SBS mutational burdens for NSCLC and melanoma groups were 187 and 501, respectively, both substantially different than the original publications (266 and 402, respectively). Previously, melanoma analyses were performed using SomaticSniper ( 60 ) and led to a lower number of detected mutations, consistent with the lower sensitivity that we observed with this method ( Fig. 2 , A and C). Using the individual TMB from our analyses, we observed an improved prediction compared to previous mutation calls for progression-free survival (NSCLC) and overall survival (melanoma) for each entire patient cohort ( Fig. 5 , C and D, and fig. S6), and improved prediction compared to the previous validation set for the melanoma cohort (fig. S7). We examined whether these analyses would lead to changes in the classification of patients from high TMB to low TMB or vice versa. We found that of the 34 NSCLC patients treated with anti–PD-1 therapy, Cerebro mutation calls resulted in 4 patients (11.8%) switching from high TMB in the original publication to low TMB in our analysis ( Fig. 5 , C and D). These four patients had an average progression-free survival of 3.25 months. In contrast, of the 64 melanoma patients treated with anti–CTLA-4, Cerebro mutation calls resulted in 9 patients (14.0%) switching from low TMB classification in the original publication to high TMB in our analysis. These nine patients had an average overall survival of 40 months. Overall, these analyses suggest that improved mutation determination may have even greater impact on clinical outcome for immune checkpoint blockade than previously anticipated.

Comparison of Cerebro mutation calls with published calls associated with NSCLC (left panels) or melanoma (right panels). ( A ) Unique/shared mutation status for all patients. ( B ) Problematic mutations unique to original publications annotated by characteristic issue. Annotation, conflicting consequence; TDP < 3, tumor distinct pairs less than 3; TBQ < 30, tumor base quality less than 30; Unaligned, no alignment of mutated reads to the mutation position; MAF < 5%, mutant allele frequency less than 5%. ( C ) Kaplan-Meier analysis of progression-free survival (left) or overall survival (right) using tumor mutation loads from original publications. ( D ) Kaplan-Meier analysis of the same samples using Cerebro mutational loads. Log-rank P value shown for each survival plot. DCB, durable clinical benefit; NDB, no durable benefit.

To evaluate recently reported associations between total somatic mutation count (that is, mutational burden) and response to immune checkpoint blockade, we next obtained paired tumor-normal exome data from two recent studies including a study of response to anti–PD-1 therapy for 34 NSCLC patients ( 19 ) and a study of response to anti–CTLA-4 in 64 melanoma patients ( 17 ). We compared mutations called in these NGS data by Cerebro to the mutations reported in the original publications, limiting our analyses to nonsynonymous SBS changes because other types of alterations were not included in the published analyses. Across the NSCLC cohort, 9049 and 6385 mutations were identified in the original study and our reanalysis, respectively (table S11). In the melanoma cohort, 25,753 and 32,092 mutations were identified by the original publication and our reanalysis, respectively. Among all mutations in the NSCLC set, 48.2% were concordant between Cerebro and the original report, whereas 61.9% of mutations in the melanoma cohort were concordant ( Fig. 5A ). We performed an in-depth characterization of mutations that were identified in the original publications but that would be considered false positives using Cerebro and found that the large majority of such calls could be attributed to systematic issues such as limited observations of the mutation in distinct read pairs, poor base quality at the mutation position, and inaccurate alignment ( Fig. 5B ).

Evaluation of mutations in 66 oncogenes and tumor suppressor genes indicated ( A ) a large number of low-confidence mutations unique to TCGA associated with various problematic features. No High-Quality Alignment, no consistent alignment found with at least one mutant base with quality higher than 30; MAF < 5%, mutant allele frequency below 5%; TBQ < 30, tumor base quality below 30. ( B ) Distribution of problematic TCGA driver gene calls by genes with approved FDA therapies (left panel) or ongoing clinical trials (right panel). ( C ) Quality characteristics of mutations uniquely called by TCGA were more problematic than other identified mutations. NMAF, normal mutant allele frequency; TMMQ, tumor mutant mapping quality; TBQ, tumor mutant base quality).

To more carefully evaluate discordant alterations in TCGA, we investigated somatic mutation calls for a subset of 66 well-characterized cancer driver genes (table S10). Of the 1368 evaluated tumors, 1257 (92%) had a mutation in this gene set, with 4037 shared somatic mutations between TCGA and Cerebro, and a total of 777 alterations that were discordant between the analyses. Further examination revealed that most of the 429 mutations called only by TCGA were associated with poor sequence quality and alignment issues that were likely to not represent bona fide alterations ( Fig. 4A ). Among driver genes associated with FDA-approved therapies or ongoing clinical trials ( Fig. 4B ), we found low-confidence TCGA mutations in 211 (16.8%) patients analyzed. We found that mutations uniquely called by TCGA were of significantly lower confidence than mutations observed by both platforms (P < 0.0001; Fig. 4C ) or those that were uniquely detected by Cerebro (P < 0.0001; Fig. 4C ), as determined by measures of sequence and alignment quality at those positions as well as the presence of the alterations in the matching normal sample. Analysis of COSMIC (Catalogue Of Somatic Mutations In Cancer) hotspot mutations in these genes identified 38 alterations that were missed by TCGA analyses, representing 4.3% of all COSMIC hotspots analyzed (table S10). The alterations that were missed included those in 15 patients with KRAS hotspots at codons 12 and 13 that had an average mutant allele fraction of 28%. In contrast, the Cerebro approach may have missed only one KRAS hotspot alteration in a tumor because of the fact that this alteration was also present in the matching normal sample of that patient. Overall, these results suggest that current TCGA data sets contain a substantial number of false-negative and false-positive somatic alterations in both passenger and driver genes among many tumor types, and the new mutation calls determined here provide an improved resource for analysis of TCGA sequence data.

Somatic mutations from the TCGA MC3 project and Cerebro were compared for concordance. ( A ) Total mutational loads between the two mutation calls shared by cancer type. Mutation loads were defined as the total number of nonsynonymous mutations per sample. LUSC, lung squamous cell carcinoma; LUAD, lung adenocarcinoma; BLCA, bladder; STAD, stomach; COAD, colorectal; SARC, sarcoma; HNSC, head and neck squamous cell; SKCM, melanoma; UCEC, uterine; LIHC, liver; KIRC, kidney; *-H, set enriched for high–mutation load samples. ( B ) Unique/shared status for somatic mutations across all samples. C, colorectal; H, head and heck squamous cell.

The total number of somatic mutations measured across the various tumor types was largely similar to previous analyses of TCGA exomes, indicating that estimated mutational loads derived from Cerebro were consistent and representative of the TMB ( Fig. 3A and fig. S4). We found a significant positive correlation between mutation loads called by Cerebro and the TCGA PanCanAtlas MC3 method ( 43 ) that uses consensus calls among seven different mutation callers (Pearson correlation coefficient = 0.93, P < 0.0001; Fig. 3A ), with 74.0% of somatic mutations shared between the two approaches. However, we found that 10.3% of calls detected by Cerebro (n = 44,439) were apparently missed by TCGA, whereas 15.7% of alterations identified by TCGA (n = 68,138) were not considered somatic alterations by Cerebro ( Fig. 3B and table S9). Individual tumors that were reanalyzed by Cerebro had mutation loads that differed by as many as 390 (95%) fewer or 729 (800%) more alterations compared to original calls. Comparison of detected alterations between Cerebro and other TCGA call sets [MC3, MuTect2 ( 59 ), and FIREHOSE original calls] showed increasing concordance with Cerebro calls from MuTect2 (least concordant; average, 60.2%) to MC3 (most concordant; average, 75.8%) (figs. S4 and S5). These observations are consistent with our analyses of individual mutation callers ( Fig. 2 ) and support the notion that the use of multiple approaches in MC3 is likely to reduce the false-positive observations resulting from individual methods but may still have additional errors compared to Cerebro.

We assessed whether the improved capabilities of Cerebro could be used to increase the accuracy of mutation calling in large-scale cancer genome sequencing efforts including TCGA, because these serve as the basis for various research efforts in human cancer. We used Cerebro to analyze paired tumor-normal exomes from 1368 patients in TCGA, focusing on tumors that would be relevant for both targeted therapies and immunotherapy. This set consisted of all available patients with non–small cell lung adenocarcinoma, non–small cell lung squamous cell carcinoma, or bladder urothelial carcinoma, as well as selected patients with higher mutational loads that had melanoma or colorectal, gastric, head and neck, hepatocellular, renal, or uterine cancer ( Table 1 ). A total of 365,539 somatic alterations were identified, with an average of 267 somatic alterations per tumor (range, 1 to 5871).

To assess the mutation detection performance of Cerebro using independently obtained experimental data, we next analyzed five matched tumor and normal specimens for which somatic mutations had been previously identified and validated through independent whole-exome sequencing (WES) ( 28 , 30 ). These previous analyses carefully evaluated the entire coding sequences of the samples through PCR amplification of 173,000 coding regions and Sanger sequencing of the amplification products ( 28 , 30 ). Any observed alteration was resequenced in the tumor and normal sample to confirm its tumor origin. Because the Sanger sequencing analyses were designed to identify only clonal or near-clonal alterations, we supplemented the Sanger-validated alterations previously observed in this set of samples (n = 314) with additional bona fide changes that had either been identified by a consensus of multiple NGS mutation callers (n = 163) or detected by up to two mutation callers and validated using droplet digital PCR (ddPCR) (n = 18; fig. S2 and table S7), a highly sensitive method for detection of alterations in a subset of DNA molecules ( 64 ). Comparison of results from all mutation callers with this reference set of alterations revealed that Cerebro had the highest overall accuracy compared to other methods ( Fig. 2C , fig. S3, and table S8).

To systematically assess the accuracy of Cerebro for mutation detection, we designed a series of validation studies using simulated and experimental cancer exomes and evaluated the performance of this approach as compared to existing software tools commonly used for somatic variant identification in research and clinical genomic analyses ( Fig. 2A and table S2) ( 59 – 63 ). We performed three studies using a set of six normal cell lines with NGS exome data. Two of those studies contained simulated mutations, with in silico somatic mutations spiked in to a normal DNA sample. In our first study, we simulated low-purity tumors by incorporating 132 coding somatic mutations [120 single-base substitutions (SBSs), and 12 insertions and deletions (indels)] with mutant allele frequencies ranging from 10 to 25% in the exome data ( Fig. 2A ). To characterize false-positive rates for the various tools, we analyzed technical replicate exome pairs of the six normal cell lines (fig. S1 and table S3). We additionally created simulated exomes with in silico spike-ins of 7000 somatic SBS and indel changes with variable mutant allele frequencies (ranging from 10 to 100%), comprising a total of 42,000 somatic changes across the six samples that could be detected through these approaches ( Fig. 2B ). This last study allowed us to examine the sensitivity of the various tools for specific mutation types and allele frequencies. In all cases, the location, type, and level of in silico alterations were different from those used in the training of the Cerebro algorithm. Overall, we observed substantial variation in sensitivity and positive predictive value among the tested variant classification programs ( Fig. 2 , A and B, and tables S4 to S6). Cerebro maintained the highest level of sensitivity and positive predictive value, whereas other methods resulted in moderate to high false-positive rates ( Fig. 2A and tables S4 to S6). False-positive calls by other methods were frequently associated with indicators such as poor tumor/normal coverage, low mutant base quality, and low mutant mapping quality (fig. S1).

We considered more than 300 features that might optimize performance for identifying true somatic variants, ultimately selecting 15 feature categories from two separate alignment programs that included alignment characteristics (mapping quality and mismatches), sequence quality information (coverage and base quality), and information related to specific alterations (allele frequency, nearby sequence complexity, and presence of alteration in matching normal specimen) (table S1). Once implemented, Cerebro used 1000 decision trees for analysis of each mutation, with each tree evaluating a unique combination of the selected information supporting a candidate variant. The resultant confidence score from the Cerebro model represented the proportion of decision trees that would classify a candidate variant as somatic.

The manner in which this training set was created is an integral part of the machine learning approach that we present here. With this training set, we provided our classifier a representative set of experimentally obtained artifactual changes and germline alterations across the exome, along with a large number of true in silico mutations comprising a wide spectrum of allele frequencies and genomic contexts. Adding in silico mutations allowed us to include training data in regions where experimentally obtained alterations would not have provided sufficient sensitivity across the exome.

We created a method for analyzing next-generation cancer sequence data, called Cerebro, that uses machine learning to identify high-confidence somatic mutations while minimizing false positives ( Fig. 1 ). Cerebro uses a specialized random forest classification model that evaluates a large set of decision trees to generate a confidence score for each candidate variant (Materials and Methods). The model was trained using a normal peripheral blood DNA sample, where exome regions were captured and sequenced twice using NGS methods. More than 30,000 somatic variants comprising substitutions, insertions, and deletions at mutant allele fractions from 1.5 to 100% were introduced in silico into one set of NGS data to provide the classifier with a training set of “tumor-specific” mutations as well as a real-world representative source of more than 2 million NGS errors and artifacts that might otherwise be mislabeled as variants. The second NGS sequence data set from the same sample was used as the matched normal, and the combined data sets were analyzed for the detection of somatic variants.

Overall, as cancer genomic analyses continue to expand and gain acceptance in the clinical community, the ability to effectively design and validate methods for mutation identification remains challenging. Our approach, which is entirely automated, would eliminate the need for expert review of sequence data, a practice commonly used in the clinical setting and likely unsustainable for widespread NGS analyses. Large-scale validation of these methods will provide new opportunities for the treatment and management of patients with cancer.

This study has implications beyond analyses of tumor tissues, including, for example, in deep sequencing of cell-free DNA (cfDNA) to identify somatic mutations in the circulation ( 76 ). Current approaches for cfDNA analyses use sequence data with over ~30,000× coverage to identify alterations with concentrations as low as 0.05% ( 64 , 76 – 85 ). Highly accurate analyses of these sequences will be needed to provide robust differentiation of true-positive and true-negative mutation calls, especially in cases without previous knowledge of sequence alterations from tumor tissue.

Our analyses may have important implications for therapies using genomic information, including targeted therapies and immune therapy approaches targeting mutation-associated neoantigens. Although MSI testing is currently used to identify the small number of patients with dMMR that benefit from immune checkpoint blockade, there are large-scale efforts in developing TMB as a biomarker for immunotherapy response because this approach could identify a larger number of patients likely to benefit from this therapy. Although the number of samples that we analyzed was small, our use of Cerebro to determine TMB in NSCLC and melanoma patients improved the accuracy of stratifying patients into likely responders and nonresponders; in particular, 13% of patients from these two studies were reclassified using the Cerebro mutation analyses. Improved discrimination of bona fide alterations may facilitate development of mutation load–based predictive biomarkers for immune checkpoint blockade as well as for understanding changes in cancer genomes during immune therapy ( 75 ). Development of mutation-specific vaccines and immune cell–based therapies will require high-confidence identification of alterations that may be unique to individual patients.

Given the fundamental importance of somatic genomic alterations in human cancer, the improvements that we have developed are likely to have meaningful implications for research and clinical analyses. Our reanalysis of TCGA and exome data from patients treated with cancer immunotherapy identified that a substantial fraction of existing mutation calls are likely to be false-positive changes associated with low-quality evidence and that many true alterations may have been missed in current databases. We estimated that 16% of alterations are likely inaccurate in current TCGA mutation databases and an additional 10% of true alterations may have been missed in these data sets. If these ratios are accurate across TCGA (with ~2 million somatic mutations across 10,000 exomes), then the overall number of false-positive and false-negative changes in TCGA is likely to be >500,000. Such discrepancies are likely to be important across a variety of additional efforts, as much of TCGA has been incorporated into mission-critical databases such as COSMIC ( 69 – 71 ), gnomAD/ExAC ( 72 ), Genomic Data Commons ( 73 ), and the International Cancer Genome Consortium ( 74 ).

Despite the advantages of our approach, this study has several limitations. For example, our training data sets were specific to certain genomic analyses. Because we have focused on coding regions within exome or targeted sequencing with >150× coverage using Illumina platforms, analyses of NGS data with different characteristics, such as other sequencing technologies or lower sequence coverage levels, would not be expected to be as successful using the current set of features, parameters, and training data. Additionally, analyses of whole-genome sequencing data sets containing noncoding and repeat elements would also likely require expanded training data to reflect the unique characteristics of these regions. Our approach is not intended for identification of germline variants, although other methods have been designed for this purpose ( 68 ). Furthermore, our analyses were largely focused on tumors with high tumor cellularity, and we have not comprehensively evaluated detection of low-frequency alterations, nor did we evaluate structural changes in these tumors. Although our approach has been optimized for somatic mutation identification using a common NGS technology, we expect that with sufficient training data, our method could be successfully used with new sequencing platforms and applications.

Our assessment of mutation calling approaches revealed that existing methods for somatic mutation detection may be influenced by factors that can lead to considerable false positives and false negatives. The machine learning approach that we described here, together with the large amount of experimental and in silico training data, identified key features in NGS sequence data to minimize false-positive calls and to improve sensitivity for bona fide alterations. An important aspect of our approach is the use of dual sequence alignments to improve identification of bona fide alterations (Supplementary Materials and Methods), effectively removing half of the erroneous mutation calls. Our overall approach provides the highest sensitivity and specificity of all methods that we analyzed. Although it is conceivable that other mutation callers could achieve similar performance characteristics, it would be challenging to identify appropriate parameters for these callers without taking a large-scale training and validation approaches similar to ours. Machine learning methods have previously been proposed for somatic mutation discovery including MutationSeq ( 65 ), SomaticSeq ( 66 ), and SNooPer ( 67 ). However, these tools were not trained with extensive data, nor were they validated using independent (non-NGS) sequencing approaches.

This study describes the development of a machine learning approach for optimizing somatic mutation detection in human cancer. Our analyses demonstrated that high-accuracy mutation detection can improve identification of bona fide alterations to determine total mutational burden to predict outcomes to immunotherapy, as well as to detect alterations in potentially actionable driver genes. These data highlight the challenges of detecting somatic sequence alterations in human cancer and provide a broadly applicable means for detecting such changes that is more accurate than existing approaches.

MATERIALS AND METHODS

Study design This study provides a somatic mutation detection algorithm using a machine learning approach. We trained our method using in silico mutations spiked into replicate exome sequencing runs of a well-characterized peripheral blood sample. We estimated our method’s sensitivity and specificity using five matched tumor/normal cancer cell line pairs with somatic variants previously identified and validated through Sanger sequencing and other means. Additionally, we compared our method’s accuracy against six other mutation detection methods using the same validated mutation data. We determined concordance between our method and a consensus somatic mutation call set using exome data from 1368 patients available through the TCGA. We also evaluated the association between clinical response to immune checkpoint blockade and TMB using two cohorts of paired tumor-tissue and normal exome samples obtained before immune therapy. These cohorts were composed of stage IV NSCLC patients (n = 34) treated with anti–PD-1 and metastatic melanoma patients (n = 64) treated with anti–CTLA-4. We compared the performance of the method to three additional NGS clinical cancer sequencing tests using two sample cohorts, including replicates of FFPE or frozen tumor samples from breast, lung, and colorectal cancers (n = 37). All analyzed samples were obtained under Institutional Review Board–approved protocols with informed consent for research use at participating institutions.

Development of Cerebro In choosing a machine learning model for Cerebro’s development, we focused on models that could quickly process large amounts of training data during model fitting. This was an important consideration given the large amount of training data used (more than 2 million candidate mutations) and the need to build many models during initial testing. We used the scikit-learn software package (86) to implement our models, and this also informed our model decision. We considered machine learning techniques such as support vector machines (SVMs) available through scikit-learn, but as the SVM training algorithm scales more than quadratically with the number of training examples, we found that SVMs were not applicable to our training data. Additionally, the implementations of models that use either adaptive or gradient boosting did not support parallelization of model fitting, which also made them unsuitable for our training data. Random Forest classifiers (87) provide support for parallelized model fitting and are well suited to large training data sets. The closely related approach of extremely randomized trees (or “Extra-Trees”) (88) is able to fit models more rapidly because of the Extra-Trees classifier’s selection of random thresholds versus the Random Forest classifier’s computationally expensive determination of optimal thresholds for each examined feature. We used default settings in the Extra-Trees model, with the exception of (i) parallelization options to use all available processing cores and (ii) increasing the number of decision trees in the model to 1000. All other parameters were left as default values as provided by scikit-learn (criterion=“gini”; max_features=“auto”; max_depth=None; min_samples_split=2; min_samples_leaf=1; min_weight_fraction_leaf=0; max_leaf_nodes=None; bootstrap=False; oob_score=default; random_state=None; warm_start=False; class_weight=None). On the basis of review of visual analyses of bona fide somatic mutations, we chose a set of features that described mutation, sequence quality, genomic context, and alignment characteristics. These are described in table S1. We used two common aligners [BWA-MEM (Burrows-Wheeler Aligner–Maximal Exact Match) (89) and Bowtie2 (90)] to provide multiple estimates of alignment-related features. We did not make any attempt to remove redundant features from our model because these are appropriate for Extra-Trees models.

Processing of exome data with Cerebro Reads from tumor cell lines and matched normal samples sequenced at PGDx were adapter-masked and demultiplexed using bcl2fastq (http://support.illumina.com). All read data, including those from PGDx, TCGA, and immunotherapy whole-exome studies, were aligned with BWA-MEM (89) and Bowtie2 (90) to the hg19 reference assembly of the human genome (91), with unplaced and unlocalized contigs and haplotype chromosomes removed. Then, Cerebro identified candidate somatic mutations by examining alignments in the matched tumor and normal samples. Alignment data were filtered to remove nonprimary alignment records, reads mapped into improper pairs, and reads with more than six edits. Individual bases were excluded from mutant coverage calculation if their Phred base quality was <30 in tumor samples and <20 in normal samples. Only candidate somatic variants found in both pairs of alignments (BWA-MEM and Bowtie2) were scored using our confidence scoring model (see table S1). Candidate variants with somatic confidence scores <0.75, <3 distinct mutant fragments in the tumor, <10% mutant allele fraction (MAF) in the tumor, or <10 distinct coverage in the normal sample were removed. For our analysis of cancer immunotherapy response–associated data (17) or NSCLC (19), we included mutations ≥5% MAF to compensate for the low tumor purity that appeared to be present in some samples. For mutations found in at least 50 samples according to the COSMIC v72 database (“hotspots”), we applied relaxed cutoffs. For such hotspot mutations, we only excluded bases in the tumor sample if their Phred base quality was <20. We also only removed candidate hotspot mutations if they had somatic confidence scores <0.25, <2 distinct mutant fragments in the tumor, or <5% MAF in the tumor. Because sequence data obtained from TCGA were often less than 100 bp in length, which we found to reduce Bowtie2 alignment sensitivity for long indels, we also created a set of relaxed cutoffs for hotspots that were indels >8 bp in length. These relaxed indel hotspot filtering criteria focused only on the BWA-MEM alignments and removed mutations with <5 distinct fragments in the tumor, a left-tailed Fisher’s exact test P > 0.01, <5% MAF in the tumor, or any mutant fragments in the normal sample. Variants were further filtered for coding consequence using VEP (Variant Effect Predictor) (92) and CCDS (Consensus Coding Sequence)/RefSeq (93) to remove intragenic and synonymous mutations. Finally, variants that were listed as Common in dbSNP (Single Nucleotide Polymorphism database) (94) version 138 were removed.

Processing of exome data with external variant callers Read data were aligned with BWA-MEM (89) to an hg19 reference assembly of the human genome (91), with unplaced and unlocalized contigs and haplotype chromosomes removed. The Picard (95) MarkDuplicatesWithMateCigar program was used on the resulting BAM files to find optical and PCR duplicates. Each external variant caller was run with default parameters and filters as described in Supplementary Materials and Methods. In the case of Strelka, we used the reported “tier 2” set of variants. Similarly to the processing with Cerebro, for all variant callers, we removed variants with MAF < 10%, as well as intragenic and synonymous mutations, and variants listed as Common in dbSNP138. Variants failing a caller’s default set of filters were also removed. For VarDict, we also removed variants that hit either of two filters suggested by one of VarDict’s authors (see the Supplementary Materials).

Confidence scoring model and training A sample derived from normal peripheral blood (CRL-2339, American Type Culture Collection) was used to generate a genomic library for exome capture as previously described (5) and analyzed twice using NGS. One of these sequencing runs was designated the “training tumor” and had novel variants spiked into it using BAMSurgeon (96). Novel coding variants were randomly generated across the exome, at MAFs ranging from 1.5625 to 100%, according to a distribution defined by 2−R, where R is a uniform random variable between 0 and 6. After accounting for the variants that could not be inserted because of low coverage or presence of polymorphisms, these novel variants were a mixture of 16,958 substitutions, 6675 insertions, and 6720 deletions. The range of MAFs used for training was intended to begin well below the expected limit of detection (5 or 10% MAF) to ensure that calls near the limit would be accurate. Novel indels ranged from 1 to 18 bp in length. Additional novel indels were spiked in by locating short-tandem repeat tracts (either mono-, di-, or trinucleotide repeats) within the exome and inserting one or two repeat unit contractions or extensions of the tracts. After spiking the training tumor with BAMSurgeon, the read data from the training tumor were realigned with both BWA-MEM and Bowtie2, and then all candidate somatic variants supported by at least one tumor read were reported using Cerebro. Candidate somatic variants found in both pairs of alignments formed the training set for the scoring model, which included the 30,353 spiked “true somatic” mutations (class 1) and 2,016,867 artifactual or germline mutations (class 0). For each candidate somatic variant, Cerebro considers the mutation, sequence quality, genomic context, and alignment characteristics (listed in table S1). These values were calculated for the two sets of alignments and were concatenated together to form a feature vector for a candidate somatic variant. The training set for the scoring model consisted of the feature vectors for each candidate variant and a class labeling indicating whether or not the variant was spiked into the training tumor (that is, if the candidate is a somatic variant or not). The scoring model itself is an extremely randomized trees model (88) with 1000 decision trees, implemented using the scikit-learn library (86), with the reported confidence score being the percentage of the model’s trees that would classify the variant as somatic.

Validation of somatic variants We analyzed five matched tumor/normal breast cancer cell line pairs in which several hundred somatic variants had previously been identified and validated through Sanger sequencing analyses (28, 30). To add to this set, we performed exome sequencing of the cell lines using an NGS approach as previously described (5). We then analyzed the NGS data through three variant calling programs (VarDict, MuTect 1, and our Cerebro pipeline). Somatic variants called by all three programs were considered to be validated, provided that they passed a visual inspection to remove possible artifacts; mutations in this set that did not clearly pass visual inspection were tested using ddPCR. Somatic variants called by one or two of those programs with a reported MAF of at least 20% (by at least one program) were visually inspected for alignment artifacts. Those variants passing visual inspection and having at least 10 reads covering the locus in the normal sample were then tested using ddPCR, and variants validated by ddPCR formed the remainder of our validated variant set.

Evaluation of variant caller accuracy For simulated data sets, true positives (TPs) were those spiked-in somatic variants found by a program, false positives (FPs) were variants called by a program that were not spiked in, and false negatives (FNs) were spiked-in variants not called by a program. For these simulated data sets, sensitivity is defined as TP/(TP + FN), positive predictive value (PPV) is defined as TP/(TP + FP), and false-positive rate (FPR) is the number of false positives reported per megabase of the exome (51.5 Mbp). For the cell line data sets, we created a validated variant set as described above and selected those variants called by at least one caller as having a MAF of at least 20%; these selected variants created the validated comparison set. When evaluating the variant callers on cell line data, TP were comparison variants found by a program; FP were variants not in the comparison set that were called by a program with a MAF of at least 20%; and FN were comparison variants not found by a program. Because we only validated variants reported to have a MAF of at least 20%, we defined PPV for the cell line data as X/(X + FP), where X is the number of TP variants that a program claimed had a MAF of at least 20%. This approach allowed us to compensate for the variation in reported allele frequency between variant calling programs by restricting PPV calculation to only those variants that a program reported as being over the validation MAF threshold of 20%. The following scores were used for thresholding for Precision-Recall/ROC curve generation: Cerebro: Cerebro score; MuTect1: t_lod_fstar ("Log of (likelihood tumor event is real / likelihood event is sequencing error)"); MuTect2: TLOD (same as MuTect1); SomaticSniper: tumor mutant coverage; Strelka: "Quality score" in VCF ("QSS"/"QSI" from SNV/indel); VarDict: Variant quality in VCF (log(AD) * mean BQ); VarScan: −log(somatic P value).

Simulation experiments We performed three simulation experiments designed to evaluate the accuracy of the various variant calling programs. In each experiment, we used a set of simulated tumor-normal pairs created by sequencing six exome-captured normal cell lines twice to create six sample pairs; one of the samples in each pair was designated the “tumor,” and both samples were aligned to the human genome using Bowtie2. Depending on the experiment, a set of artificial coding somatic variants were inserted into the tumors using BAMSurgeon. After BAMSurgeon was run, the read sequence data were extracted from the modified BAM file, and the resultant FASTQ data were aligned again using the methods described above. Our first simulation experiment was designed to simulate low-purity tumors with 120 SBSs and 12 indels inserted into each tumor at MAFs ranging from 10 to 25%. In this final simulation, we evaluated several accuracy metrics, including sensitivity, PPV, FPR, and F-score. The second experiment was solely a test of specificity, where we did not insert any somatic variants into our simulated tumor; this examination of somatic variant calling specificity with technical replicates is similar to that discussed by Saunders et al. (61) in their presentation of Strelka. In our final simulation experiment, in which we focused on sensitivity, we inserted 7000 coding variants, consisting of 1000 SBSs and 6000 indels (1000 each of 1-, 2-, and 3-bp insertions and 1-, 2-, and 3-bp deletions). The second experiment’s variants were inserted at MAFs ranging from 10 to 100%.

ddPCR methods ddPCR forward and reverse primers as well as wild-type and mutant probes were created using the Bio-Rad ddPCR Custom Design Portal (www.bio-rad.com/digital-assays). Genomic DNA corresponding to 10,000 genome equivalents was added to 10 μl of 2× ddPCR Supermix (Bio-Rad), 1 μl of 20× target primers and probe (FAM), and 1 μl of 20× reference primers and probe (HEX) and brought to a 22-μl volume with nuclease-free water to create a reaction mix. A DG8 cartridge in a DG8 cartridge holder (Bio-Rad) was loaded with 20 μl of reaction mix and 70 μl of Droplet Generation Oil (Bio-Rad). The cartridge was placed in a QX200 Droplet Generator to generate about 20,000 nanoliter-sized droplets. Droplets were loaded into a twin.tec 96-well, semi-skirted plate (Eppendorf) and sealed with a foil heat seal using a PX1 PCR Plate Sealer (Bio-Rad). Subsequent PCR cycling was performed on a C1000 Touch Thermal Cycler (Bio-Rad) with the following conditions: 95°C for 10 min, followed by 40 cycles of 94°C for 30 s and 55°C for 1 min, and ending with 98°C for 10 min. The plate was then loaded on a QX200 Droplet Reader (Bio-Rad), and PCR-positive and PCR-negative droplets were quantified. Raw droplet data were analyzed with QuantaSoft software (Bio-Rad). Thresholds were manually assigned using two-dimensional amplitude clustering plots and the crosshair tool for each tumor and normal pair. Tumor samples were run in duplicate, and the average mutant allele fraction was taken. For the comparative evaluation of clinical targeted sequencing panels, the ddPCR protocol was modified as follows: 5500 genome equivalents were used, and 1 μl of a 20× mixture of target primers and probes and reference primers and probes (FAM and HEX, respectively) were used (Bio-Rad). Because there were no matched normal samples, wild-type and mutant oligomers were designed as controls for each target investigated (Operon Biotechnologies).

WES data extraction and annotation The results shown here are in part based on data generated by the TCGA Research Network (http://cancergenome.nih.gov), as outlined in the TCGA publication guidelines http://cancergenome.nih.gov/publications/publicationguidelines. TCGA WES data sets (bam alignment files) represented untreated primary tumors and paired normal tissue samples obtained from the Cancer Genomics Hub (http://cghub.ucsc.edu). WES-derived somatic mutation calls were obtained from the MC3 project. TCGA somatic mutation calls (v0.2.8) were also obtained from the Synapse repository (http://synapse.org; Synapse ID, syn7214402). Variant-supporting coverage and total coverage were extracted and manually reviewed for consistency. To normalize across cancer types, mutations with fewer than three variant-associated reads or less than 10% mutant allele frequency were filtered before downstream comparative analysis. For mutations found in at least 50 samples according to the COSMIC v72 database (hotspots), we allowed for 5% minimum mutant allele frequency. Additional somatic mutation call sets generated by MuTect2 were downloaded from the Genomic Data Commons (http://gdc.cancer.gov) and Broad GDAC Firehose (http://gdac.broadinstitute.org) using the firehose_get download client prioritizing the BIFIREHOSE_Oncotated_Calls somatic mutation call sets compiled from various TCGA Genome Sequencing Centers’ bioinformatics pipelines. For comparisons of Cerebro to other call sets, mutations were required to fall within a common region of interest (ROI) set. The source of the primary mutation calling tools used for each cancer type may be found in the corresponding TCGA marker publications (http://cancergenome.nih.gov/publications). Individual shared and unique mutations between Cerebro and TCGA MC3 are displayed in table S14. Concordance analysis of somatic mutations from 66 oncogenes and tumor suppressor genes included manual review to determine shared status of mutations within the same or adjacent codons. Whole-exome melanoma (17) or NSCLC (19) immunotherapy data sets were obtained via National Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes (dbGaP) (accession numbers phs000980 and phs001041). We excluded one sample from (19) in our comparative analysis (patient ID: SB010944) because of conflicting information regarding TMB in the original publication.

Clinical study design and analysis The first comparison of NGS assays included 22 total samples, 17 FFPE samples, and 5 frozen tumor tissue specimens obtained from lung cancer patients from ILSBio/Bioreclamation that were analyzed for the presence of sequence mutations using three targeted cancer gene panels from independent vendors. The second comparison of clinical NGS assays used 15 matched tumor-normal samples from breast (n = 9), lung (n = 5), and colorectal (n = 1) cancers from ILSBio/Bioreclamation or Indivumed. All samples were obtained under Institutional Review Board–approved protocols with informed consent for research use at participating institutions. One set of patient samples was processed and analyzed for sequence mutations by Personal Genome Diagnostics using the CancerSELECT 125 panel. In brief, samples were reviewed by a pathologist to determine the percent tumor purity followed by macrodissection and DNA extraction. DNA was fragmented and used for CancerSELECT 125 library preparation and capture. Libraries were sequenced using HiSeq 2500 instruments (Illumina). Sequencing output was analyzed and mutations were identified using Cerebro, which was retrained to reflect the composition and coverage of the CancerSELECT 125 panel. In our first comparison of NGS assays, an identical set of FFPE and frozen tumor tissue specimens were sent to MolecularMD along with a hematoxylin and eosin–stained image for processing and NGS analysis using two cancer-specific panels supplied by two independent vendors: Oncomine Comprehensive Assay (ThermoFisher) and TruSeq Amplicon—Cancer Panel (Illumina). In the second comparison of NGS assays, an identical set of FFPE tumor tissue specimens and matched normal blood specimens were sent to Memorial Sloan Kettering Cancer Center along with a hematoxylin and eosin–stained image for processing and NGS analysis using the MSK-IMPACT panel. To limit the effects of tumor heterogeneity on analysis, slides were distributed in nonsequential order for testing. For orthogonal analysis, comparisons were limited to ROIs that were included in a comparison (table S15). Samples that failed quality check in one or more panels were excluded. A sequence mutation was considered a true positive (TP) if there was positivity in at least two panels and a false positive (FP) if only detected in one panel. Sequence mutations were considered true negatives (TNs) if there was negativity in at least two panels. A position with no mutation detected was considered a false negative (FN) in a panel if that position was concordantly positive in the other two panels. Genomic positions that were masked on the basis of known SNPs in the CancerSELECT 125 panel were considered FP in Oncomine and TruSeq analyses. Because there were >150 FPs detected with the TruSeq panel, discordant resolution was limited to FPs or FNs obtained by CancerSELECT 125 and Oncomine (not considered SNPs) and were resolved using ddPCR. Searches for TruSeq false positives in CancerSELECT 125 and Oncomine raw bam files used Samtools mpileup (97) with minimum Phred quality thresholds of 0 and 25, respectively.