We called variants using the GATK Unified Genotyper following the best practice recommendations for exomes and then compared calls from original and binned quality scores. Both approaches for binning — pre-binning, and pre-binning plus post-quality recalibration binning — showed similar levels of concordance to non-binned quality scores: 99.81 and 99.78, respectively. Since the additional binning after recalibration provides a smaller prepared BAM file for storage and has a similar impact to pre-binning only, we used this for additional analysis of discordant variants.

The table below shows the discordant differences between the 40 quality score resolution and binned, 8 quality score resolution BAMs. 40 quality discordant variants are those called with full quality score resolution but not called, or called differently, after binning to 8 quality score resolution. Conversely, the 8-quality discordants are those called uniquely after quality binning:

Overall genotype concordance 99.78 concordant: total 117887 concordant: SNPs 109144 concordant: indels 8743 40-quality discordant: total 821 40-quality discordant: SNPs 759 40-quality discordant: indels 62 8-quality discordant: total 1289 8-quality discordant: SNPs 1240 8-quality discordant: indels 49 het/hom discordant 259

We investigated the discordant variants further since 1.5% of the total variant calls change as a result of binning, Of the 1851 unique discordant variants, approximately half (928) fall into reproducible variants identified by looking at ensemble combinations of replicates. Of these potentially problematic discordant variants more than half are in low coverage regions with less than 10 reads:

The major influence of quality score binning is resolution of variants in low coverage regions. This manifests as differences in heterozygote and homozygote calling, indel representation and filtering differences related to quality and mappability. To assess the potential impact, we looked at the loss in callable bases on a 30x whole genome sequence when moving from a minimum of 5 reads to a minimum of 10, using GATK’s CallableLoci tool. Regions with read coverage of 5 to 9 make up 4.7 million genome positions, 0.17% of the total callable bases.

5 read minimum 10 read minimum Callable bases 2,775,871,235 2,771,109,000 Percent callable 96.90% 96.73% Low coverage 17,641,980 22,404,215 No coverage/ poor mapping 71,272,008 71,272,008

In conclusion, quality score binning provides a useful reduction in input file sizes with minimal impact on alignment. For variant calling, use additional caution in low coverage regions with less than 10 supporting reads. Given the rapid increases in read throughput that are driving the need for file size reduction, quality score binning is a worthwhile tradeoff for high-coverage recalling work.