Coverage in the MIG

We compared coverage performance among ACE, four conventional WES platforms (SS, SSCR, NX, NG) and WGS using the DNA from NA12878. WES and ACE platforms were compared after normalizing to both 12 Gb of total sequence data and to 100× mean coverage depth in each platform’s respective target regions. At 100× mean-target coverage (ACE, WES) and 31.5× (100 Gb) WGS, the mean coverage depth observed in the MIG was: 102.7× (SS), 125.1× (SSCR), 208.8× (NX), 95.5× (NG), 138.0× (ACE), and 29.5× (WGS). The coverage efficiency observed within MIG genes across all platforms when normalized for 100× mean target coverage depth is shown in Fig. 2. The distribution of base-quality reads observed at different levels of coverage depths is shown, centered at a clinically relevant minimum coverage of ≥20× (vertical gray line). At ≥20×, ACE covers >99 % of bases in protein coding regions and 93 % of bases in the non-coding regions compared to 93-97 % of protein coding and 50 %-73 % non-coding bases covered across WES platforms. WGS covered 97 % and 95 % of all bases in coding and non-coding regions respectively (Fig. 2). Notably, low-coverage in non-coding regions of the genome is expected with SSCR, NX, and NG, which do not substantially include non-coding areas (for example, UTRs) in the target design.

Fig. 2 Coverage efficiency in the medically interpretable genome (MIG). Shown is the cumulative distribution of on-target sequence coverage obtained from sequencing NA12878 across multiple platforms: Personalis Accuracy and Content Enhanced (ACE) Clinical Exome, Agilent SureSelect Clinical Research Exome (SSCR), Agilent SureSelect Human All Exon v5 plus untranslated regions (UTR) (SS), lllumina’s Nextera Exome Enrichment (NX), NimbleGen SeqCap EZ Human Exome Library v3.0 (NG), and 31× whole-genome sequencing (WGS) using an Illumina PCR-free protocol. For clinical applications, we indicate ≥20× as the minimum coverage threshold required (gray line) among all coding (left) and non-coding (right) regions. For reference, insets show an expanded distribution of sequence coverage. ACE and conventional WES data are normalized to 100× mean target coverage Full size image

We next examined the percentage of MIG genes ‘finished’ as the criterion for base coverage varied. Figure 3 shows the number of finished MIG genes observed in NA12878 with ≥90.0-100.0 % of constituent exonic bases covered at ≥20×. ACE achieved 100.0 % base coverage at ≥20× in approximately 90 % of the MIG genes. Conventional WES platforms (SS, SSCR, NX, NG) finished 30-65 % of genes at this level whereas WGS finished 10 %. If the stringency for per-gene percent coverage is reduced to ≥ 90.0 % of exonic bases, 100 % of genes are finished at ≥20× with ACE; between 65 % and 90 % of genes are finished among WES; and 75 % of genes are finished with WGS. Conversely, we also examined the percentage of finished MIG genes as the coverage depth was in the range of ≥10-20× (Fig. 2, right). Generally, at lower minimum coverage levels (that is, 10×) ACE finished the most genes (100 %) followed by WGS (96 %), SSCR (81 %), SS (75 %), NX (70 %), and NG (51 %). Relative WES platform performance remained consistent as the coverage finishing threshold increased to ≥20×, with ACE continuing to cover a higher percentage of bases at higher depths. In contrast, WGS coverage performance decreased sharply as coverage stringency increased, finishing only 10 % of genes at ≥20 × .

Fig. 3 Relationship between the percentages of MIG exons ‘finished’ as the coverage stringency varies. The left graph shows the percentage of MIG exons (y-axis) with ≥90.0-100.0 % of bases covered at ≥20× depth (x-axis) among different platforms using data obtained on NA12878. The right graph shows the percentage of finished exons (y-axis) with 100.0 % base coverage as the local coverage depth varies ≥10-20× (x-axis). At higher coverage stringencies, ACE finishes more exons than other WGS or WES assays in regions defined as the entire exon (solid curves) or only the subset of coding-regions (circles). ACE and conventional WES data are normalized to 100× mean target coverage Full size image

The relative breadth and depth of coverage across exons with varying GC composition was similar to the relative platform performance observed in the MIG set. ACE finished a larger percentage of MIG exons compared to other WES and WGS platforms (Fig. 4), finishing >90 % of exons regardless of the amount of GC content. Other platforms showed a decline in the number of finished exons as the percentage of GC increased, with some platforms (WGS, NG, NX) showing substantial reductions at >50 % GC content.

Fig. 4 Relationship between GC content and the percentages of MIG exons ‘finished’ by platform. Regions with >30-80 % GC content (x-axis) represent 99 % of exons in the MIG. Finishing is determined by 100 % base coverage at ≥20× Full size image

Analyses were repeated after re-normalizing WES and ACE data to 12 Gb of total sequence data (Additional file 4). Relative performance among platforms was consistent with the results reported above, which are based on data normalized to 100× mean coverage within each platforms target region. For reference, a summary of platform parameters and sequencing statistics is shown in Additional file 5.

Coverage performance in the ACMG genes and known disease-associated variants

Included within the MIG gene set are 56 genes that per ACMG guidelines [34] are recommended for examination and reporting of secondary findings during clinical genomic testing. Although concerns over the accuracy of sequencing platforms in clinically relevant regions of the genome have been widely discussed [8, 35], the lack of sensitivity of WES and WGS to known variants occurring in genes of the ACMG secondary findings list have highlighted the extent of these inaccuracies [36, 37]. The coverage of these genes and their constituent variants by these platforms illustrates how variations in design can impact clinical decision making, presuming that a lack of sensitivity to variants within these genes: (1) affects the reporting of secondary findings; and (2) is representative of other pathogenic variants not specifically assessed in this study.

Using WES and ACE data normalized to 100× coverage depth, the per-gene mean coverage observed among the 56 genes was in the range of 41-371× for WES, 24-36× for WGS, and 92-234× for ACE (Additional file 6). Ten (18 %) of the 56 genes failed to reach our predefined level of coverage (100 % bases covered at ≥20×) in any of the conventional WES platforms (SS, SSCR, NG, NX). Among these genes, eight had some proportion of their exonic bases covered at a higher depth (that is, covered at ≥20×) with ACE (MEN1, RB1, TGFBR1, PKP2, KCNQ1, KCNH2, PCSK9, RYR1) and two showed improved coverage with WGS (MEN1, TGFBR1). Exome-based platforms (WES, ACE) generally showed substantially improved breadth and depth of coverage compared to 31× WGS for these 56 genes. Fifty-four genes had some proportion of their constituent bases inadequately covered (<20×) with 31× WGS. Of these, 53 genes had a larger fraction of exonic bases covered at ≥20× using ACE and 52 had a larger fraction covered with at least one of the conventional WES platforms (SS, SSCR, NX, NG). Two genes with some proportion of their exonic bases inadequately covered (<20×) with ACE had these bases covered to ≥20× by NX (PMS2) or WGS (MEN1). The individual platform rankings based on the number of genes with 100 % base coverage at ≥20×, were ACE (51 genes) > SSCR (39 genes) > NX (36 genes) > SS (15 genes) > NG (12 genes) > and WGS (2 genes) (Additional file 6).

Several regions inadequately covered by WES platforms encompass disease-associated variants. Using 12,535 documented disease-associated SNVs (daSNV) in HGMD (version 2013_01) for the 56 ACMG genes as a ‘truth’ set, we extended our analysis to examine the fraction of daSNV loci covered at ≥10-25× with WES, ACE, and WGS platforms. Figure 5 shows the percentage daSNVs covered at ≥20× with more extensive tabular results (≥10×, ≥15×, ≥20×, ≥25×) reported in Additional file 7. For brevity, only the highest obtained base coverages achieved (Max) across all WES platforms (SS, SSCR, NX, NG) are shown. Depending on the platform used, 0.8-9.6 % (96–1,200 loci) of the daSNVs showed inadequate coverage (<20×) with conventional WES compared to 6.0 % (756 loci) for WGS and 0.2 % (26 loci) for ACE. Coverage shortfalls were spread across 41 genes, with 2,134 (17 %) daSNVs showing <20× coverage in at least one platform (WES, ACE, or WGS) (Additional file 8). Among these loci, the platforms with the highest to lowest number of loci with adequate coverage depth (≥20×) were: ACE (1,836 daSNVs), SSCR (1,727), NX (1,653), SS (1,435), NG (1,100), and WGS (968).

Fig. 5 Disease-associated variants covered at ≥20× for 56 genes in the ACMG gene list. The x-axis labels indicate the total number of disease-associated SNVs (daSNVs) drawn from HGMD for each ACMG gene; and the y-axis indicates the percentage of those variants covered at ≥20×. For brevity, only the highest obtained percentage (Max over all WES) observed across all conventional WES (SS, SSCR, NX, NG) platforms is shown. Seventeen of the 56 genes failed to have some fraction of their daSNVs covered at ≥20× among any of the conventional WES platforms. On a gene basis, the platforms with the highest to lowest number of genes with constituent daSNVs adequately covered included ACE (51 genes with 100 % daSNVs covered at ≥20×), SSCR (39 genes), NX (36 genes), SS (15 genes), NG (12 genes), and WGS (2 genes). The y-axis is truncated at 95 %, with truncated points labelled accordingly Full size image

Relative gene and daSNV coverage performance between platforms and the differences observed between platforms were consistent regardless of the normalization scheme used (total sequence data or mean coverage) for exome-based data. For reference, results using each method are reported alongside each other in additional materials (Additional files 6, 7, and 8).

Accuracy and characteristics of detected variants

Inadequate coverage, together with errors occurring in downstream alignment and variant calling, reduces the ability to accurately identify and characterize variants. Since ACE extends coverage of conventional WES to include all medically interpretable regions of the genome and targets genomic areas that are challenging to sequence, we quantified its impact on the accuracy of variant calls in: (1) the MIG; (2) genomic regions that are overlapping among exome-based (that is, ACE, WES) platforms (Common Target File); (3) functionally impactful genomic regions targeted among any exome-based platforms (Union Target File); and (4) areas of high GC content. The Common Target File allowed us to evaluate relative variant sensitivity without regard to platform-specific target design. Differences among platforms would presumably be based on variations in depth of coverage and coverage efficiency rather than due to the selective exclusion of some regions by specific capture kits (for example, the exclusion of UTRs by SSCR, NX, NG). In contrast, the Union Target File allowed us to evaluate how differences in each platforms target region (for example, differences in targeted non-coding and coding regions) impacted accuracy among variants with putative functional impact. Loci within platform specific target files were annotated with information about genomic location (for example, intron, exon, intergenic, intragenic, coding region) and predicted deleterious impact (for example, low, moderate, high, modifier/other) [38]. Regions containing loci within high (frame-shift, stop-gain, splice-site acceptor, splite-site donor, start lost, stop lost) and moderate (non-synonymous coding, codon change plus deletion/insertion, codon deletion/insertion) impact regions were combined into the Union Target File. Non-synonymous coding mutations contributed most (99 %) to the moderate-impact class in the Union Target File whereas 60 % of high-impact variants were splice-site donor/acceptor loci, followed by frame-shift mutations (20 %), stop-gain (12 %), and start/stop-lost (8 %).

For each platform, error rates and accuracy are presented in terms of the interval tested, which consists of high-confident variant loci within the MIG (Table 1, left); Common Target File (Table 1, middle); and Union Target File (Table 1, right) or a less-restrictive set of loci within subsets of GC-rich regions (Table 2). For reference, the set of genomic regions comprising the Common Target File and Union Target File and a catalogue of all 792,245 exonic regions with >70 % GC content among 20,000 genes are provided (Additional files 9, 10, and 11). Information about resources used in constructing reference and target regions is included in Additional file 12.

Table 1 Accuracy across target regions. Errors, Sensitivity, and FDR for the ACE, WGS, SSCR, SS, NX, and NG platforms based on evaluation of observed variant calls using data normalized to 100× mean coverage (conventional WES and ACE) or 31× WGS. Calculations are based on position and genotype matching to the GIBv2.18 high-confident call-set within the MIG (left), a target region common to all ACE and WES platforms (middle, Common Target File), and a target region aggregated across all ACE and WES specific target files that contain moderate-impact and high-impact loci (right, Union Target File) Full size table

Table 2 Accuracy in high-GC rich regions. Errors, Sensitivity, and FDR for the ACE, WGS, SSCR, SS, NX, and NG platforms based on evaluation of observed variant calls using data normalized to 100× mean coverage (conventional WES and ACE) or 31× WGS. Calculations are based on position and genotype matching to the GIBv2.18 less restrictive call-set within the MIG (left), a target region common to all ACE and WES platforms (middle, Common Target File), and a target region aggregated across all ACE and WES specific target files that contain moderate-impact and high-impact loci (right, Union Target File) Full size table

Using WES and ACE data normalized to 100× mean coverage depth, sensitivities across intervals ranged from 88-99 % for SNVs and 75-100 % for InDels. ACE yielded the highest sensitivities (>97.5 % SNVs; >92.5 % InDels) relative to other platforms across all intervals (Table 1). Based on sensitivities to SNVs and InDels, the relative rank of platform performance in the MIG and Common Target File were similar: ACE > SS > SSCR > WGS > NX > NG; whereas the relative rank of platform performance in the Union Target File was ACE > WGS > SS > SSCR > NG > NX. FDRs for SNVs were low across all platforms (<1 %) regardless of the interval used. For InDels, the FDR was generally highest among NG and NX across intervals. The use of the VQSLOD score for InDels, as is sometimes recommended given the larger amount of data available from WGS [24], had no effect on InDel specific errors. Regardless of the interval used, observed differences in SNV sensitivities were small across platforms. ACE showed significantly (P <0.01) improved sensitivity for SNVs compared to NX and NG and in some cases WGS (MIG: ACE vs. WGS X 2 = 16.1, P <0.01; ACE vs. NX X 2 = 61.9, P <0.01; ACE vs. NG X 2 = 102.7, P <0.01; Common Target File: ACE vs. WGS X 2 = 13.9, P <0.01; ACE vs. NX X 2 = 44.5, P <0.01; ACE vs. NG X 2 = 135.3, P <0.01; Union Target File: ACE vs. WGS X 2 = 0.1, P = 0.72; ACE vs. NX X 2 = 518.6, P <0.01; ACE vs. NG X 2 = 232.9, P <0.01); whereas no statistical significant improvement in SNV sensitivity was observed with ACE compared to SS or SSCR.

Increased breadth or depth of coverage is only asymptotically related to a higher capture efficiency, partly due to biases that occur with high-GC content [26]. These highly variable regions produce ‘gaps’ with levels of coverage insufficient for resolving disease causing variants [39]. Given the improved coverage characteristics of ACE in high GC content areas (Fig. 4), we examined its impact on accuracy in GC-rich regions. In the subset of the MIG and Common Target File containing >70 % GC content, ACE generally outperformed other platforms (Table 2) based sensitivities to SNVs (97.0 %) and InDels (>94.7 %). With the exception of NG and NX, however, the differences were small across platforms and were within the expected range of sampling error (95 % CI). In the Union Target File, WGS had the highest sensitivity (96.8 % SNVs; 95.0 % InDels), with ACE and SS sensitivities equal (94.9 % SNVs; 92.5 % InDels) in these GC-rich regions. Substantially reduced sensitivities (60-65 % SNVs; 48-58 % InDels) were observed with NG across all intervals. This was consistent with the steep reductions in coverage performance observed with NG among regions with GC fractions >50 % (Fig. 4).