There is broad agreement that genetic mutations occurring outside of the protein-coding regions play a key role in human disease. Despite this consensus, we are not yet capable of discerning which portions of non-coding sequence are important in the context of human disease. Here, we present Orion, an approach that detects regions of the non-coding genome that are depleted of variation, suggesting that the regions are intolerant of mutations and subject to purifying selection in the human lineage. We show that Orion is highly correlated with known intolerant regions as well as regions that harbor putatively pathogenic variation. This approach provides a mechanism to identify pathogenic variation in the human non-coding genome and will have immediate utility in the diagnostic interpretation of patient genomes and in large case control studies using whole-genome sequences.

Competing interests: David Goldstein is a founder of and holds equity in Pairnomix and Praxis, and has research supported by Janssen, Gilead, Biogen, AstraZeneca, and UCB. There are no patents, products in development, or marketed products to declare. This does not alter the authors’ adherence to all PLOS ONE policies on sharing data and materials.

Funding: This work was supported by the National Institute of Mental Health of the National Institutes of Health under Award Number 1U01 MH105670 and by the National Human Genome Research Institute of the National Institutes of Health under the Centers for Common Disease Genomics Award Number 1UM1HG00901. ABG was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under Award Number F31NS092362. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The collection of samples and data was funded in part by: Biogen, Inc.; Bryan ADRC NIA P30AG028377; B57 SAIC-Fredrick Inc M11-074; National Institute of Neurological Disorders and Stroke (RC2NS070344; U01NS077303; U01NS053998); National Institute of Mental Health (RC2MH089915, K01MH098126, R01MH097971); National Human Genome Research Institute (U01HG007672); Center for HIV/AIDS Vaccine Immunology (CHAVI) (U19-AI067854); National Institute of Allergy and Infectious Diseases Center for HIV/AIDS Vaccine Immunology and Immunogen Discovery (UM1-AI100645); Bill and Melinda Gates Foundation; the Ellison Medical Foundation New Scholar award AG-NS-0441-08; and the Murdock Study Community Registry and Biorepository.

Data Availability: The code used in calculating the Orion scores and the Orion regions is provided on GitHub ( https://github.com/igm-team/orion-public ) under the MIT license. The datasets generated during the study are either included in this article or are available on the figshare.com repository (Orion scores: https://figshare.com/s/e92412d44c0657b70a86 ; Orion regions: https://figshare.com/s/a3ff8c0bed660ceb67b7 ; Coordinates of defined Orion scores, non-repeat autosomal regions that were covered in our sample: https://figshare.com/s/bb660d6d86a45c6cef20 ).

Copyright: © 2017 Gussow et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

We assessed the Orion scores by evaluating how they behave in in comparison with a number of genomic features, including protein coding exons (known to be intolerant relative to the genome as a whole), ultra-conserved non-coding elements (UCNEs), and DNase Hypersensitive sites (DHSs). We found enrichment for intolerant Orion scores in each of the regions corresponding to these features, indicating that Orion scores do capture signals of intolerance to variation. We then used the Orion scores to differentiate the human genome into regions that are and are not intolerant. Using these demarcations, we show that intolerant regions are enriched for previously reported de novo mutations in patients with presumed genetic diseases and with previously reported non-coding pathogenic variants.

Here, we describe an approach termed Orion, which scans the entire genome for regions that are depleted of variation in the human population in comparison to expectation. Such depleted regions are considered intolerant. The Orion methodology quantifies the intolerance of a given stretch of sequence by estimating the difference between the observed and expected site-frequency spectrums (SFS). We applied this methodology to a set (n = 1,662) of WGS samples as a sliding window across the genome, calculating a regional intolerance score for each window. Each window's score was then applied to the base at the center of the window.

However, all of these methodologies are directly tied to known protein-coding genes, leaving the entire non-coding genome—which is known to carry disease-causing mutations [ 5 , 6 ]—untouched. Though there are many existing methods that assess the non-coding genome, these methods tend to rely heavily on conservation or functional annotations. Both of these approaches have limitations. Conservation cannot directly assess regions that have been under selection recently in the human lineage, or were under selection in the mammalian phylogeny but have lost their functionality in humans. Functional annotations can indicate the biochemical actions of a genomic region, but they cannot assess the region's likelihood of causing disease when mutated. In consequence whole genome sequence data is currently considered almost uninterpretable.

The rising prevalence of whole-genome sequencing (WGS) has led to an abundance of sequence data. The utility of WGS data in a clinical setting lies in the ability to prioritize [ 1 ] the mutations detected in patient cohorts in order to identify disease-causal mutations. We have previously introduced three population genetics-based methodologies [ 2 – 4 ] that can identify genomic regions in which variation is strongly selected against and are thus more likely to be pathogenic when mutated.

Results

Developing the Orion approach The underlying methodology for the Orion approach is based on the difference between the expected and observed site-frequency spectrum (SFS) of a given stretch of sequence. Here, the SFS is defined as a vector in which the ith element is equal to the number of variants in the assessed population sample that appear i times within the sample. Thus, the element of the SFS at i = 1 is equal to the number of variants that are singletons in the sample for a given window[7]. The element at i = 2 is equal to the number of variants that are doubletons, and so forth. We used a WGS cohort that combined an internal cohort of unrelated controls (n = 624, S1 Table) with the unrelated parents of the Simons Foundation’s Simons Simplex Collection (n = 1,038) to calculate the Orion scores. For a region of interest, we calculated the observed SFS across this WGS control cohort (n = 1,662) and the expected SFS for the cohort under neutrality [7]. For the observed SFS, we filtered for genotype quality and coverage (Methods). The expected SFS is based on the cohort sample size, the region's mutation rate [8] and length, and the effective human population size (Methods). We then calculated the difference between the observed and expected SFS in order to generate a score. As there are many ways to calculate the difference between two distributions, we used a forward-simulation framework [9] (Methods) to simulate different selection pressures on human populations. We then tested a number of score formulations and selected the one most correlated with selection pressures (S1 Text). Based on these evaluations we chose to use the weighted mean difference between points on the SFS, divided by the expected number of mutations introduced into the population per generation (θ). The weights for the weighted mean are derived from the inverse of the minor allele frequency (Methods), so that rare variants contribute more information to the final score. This is based on previous observations that the frequency of rare variation is highly indicative of intolerance [4]. In this formulation, expected is subtracted from observed. A higher score indicates a more intolerant region, while a lower score indicates a more tolerant one. Note that the expected SFS in this formulation is calculated based on neutral theory, though in practice the assumptions of neutrality do not hold. As such, we do not use the absolute value of the deviation from neutrality to assess intolerance. Rather, we compare the magnitude of deviation from neutrality between regions. Throughout this article we therefore use the Orion scores in one of two ways: either by comparing the relative difference between sets of scores, or by detecting stretches of scores that empirically match known regions that are highly intolerant.

Implementing Orion genome-wide Encouraged by these results, we implemented a sliding window approach to generate genome-wide scores. We use an odd window size so that we can calculate the Orion score of the entire window and assign it to the middle base. We selected a window size of 501bp and applied this approach across all autosomal chromosomes. We excluded bases falling in repeat regions (Methods). The resulting scores are publicly available (https://doi.org/10.6084/m9.figshare.4541632.v1) and can be extracted and downloaded for a given region (www.genomic-orion.org). We assessed the sliding-window Orion scores' behavior across the SCN1A gene. We selected 1000 random Orion scores from SCN1A' s introns and exons. We found statistically significant enrichment of higher Orion scores in the exonic regions when compared with the intronic regions (Permuted Mann-Whitney U test P value: 0.001). Specifically, exons had a mean and median of -0.174 and -0.171 respectively, while introns had a mean and median of -0.325 and -0.306.

Comparing Orion scores to key features of the human genome Following this observation across a single gene, we sought to assess the Orion score genome-wide. For this assessment, we examined whether regions that are known to be intolerant are enriched for higher Orion scores. We tested for enrichment of intolerance in three region types: protein-coding exons; UCNEs; and DHS regions. To attain an empirical-based control (presumed neutral) distribution, we selected 100,000 random Orion scores from across the genome that did not overlap with repeat regions as defined by RepeatMasker [11] (accessed November 2016) or overlapping any of the three region types described above. Following this, for each regional annotation we randomly selected 100,000 Orion scores and assessed whether there is enrichment for higher Orion scores when compared to the control distribution (Table 2). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Enrichment of higher Orion scores across regions. We found that exons are clearly enriched for higher Orion scores over the control distribution. This finding is expected, given the selective pressure on the protein-coding region. https://doi.org/10.1371/journal.pone.0181604.t002 Next, we assessed the relationship between the Orion scores and non-coding regions of the genome that are ultra-conserved. We randomly collected 100,000 Orion scores falling in UCNEs, which are defined as non-coding regions greater than 200bp in length that are identical between human and chicken [12]. We found that the Orion scores falling in UCNEs significantly differ from the control distribution (Permuted Mann-Whitney U test P value: 0.001). Thus, there is clear enrichment of intolerant Orion scores in these regions. Strikingly, the UCNE scores' mean and median are greater than the exonic scores' mean and median. Finally, we sought to assess the DHS regions. These regions of open chromatin are enriched for regulatory sequence [13]. For this assessment, we examined the intersection of DHS regions open in all cell types (Methods, S3 Data File). We hypothesized that these regions are likely enriched for regulatory elements associated with genes that are crucial for cell function and would therefore be highly intolerant. We found that these scores are indeed enriched for scores higher than the control distribution (Permuted Mann-Whitney U test P value: 0.001), and appear to have the most intolerant score population of the three regions assessed. Furthermore, this finding provides evidence that the Orion scores can indeed capture regulatory regions that are intolerant to variation. As the Orion approach is solely based on variation in the human population, we sought to assess conservation in a similar framework and compare to our results. We used GERP++[14] as our measure of conservation. We collected the GERP++ scores across the exact same coordinates we used in the Orion evaluations and tested whether the annotated regions were enriched for higher, more conserved, GERP++ scores (S4 Data File). We found that both exons and UCNEs are enriched for higher GERP++ scores (Table 3). These results were expected, given that the protein-coding genome tends to be well-conserved and UCNEs are defined by conservation. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 3. Enrichment of higher GERP++ scores across regions. https://doi.org/10.1371/journal.pone.0181604.t003 Strikingly, we found that the GERP++ scores were the lowest in the DHSs compared to the other assessed regions, while the Orion score values for the DHSs were the highest amongst all compared regions. This finding supports previous evidence [15] that these regulatory regions appear to be undergoing human lineage-specific purifying selection. Further, this indicates that the Orion score is well-positioned to detect such purifying selection. Overall, this set of analyses indicates that known intolerant regions are indeed enriched for higher Orion scores, providing evidence that the genome-wide Orion scores are capturing intolerance.

Defining the Orion regions The Orion scores are regional scores, as they are constructed based on a window that includes the surrounding bases. As such, we view the Orion scores not as variant level scores, but rather as measures that can be used in the detection of stretches of sequence that are intolerant. We aimed to detect such stretches of sequence and designate them as Orion regions. The interpretation of the Orion scores is not in their absolute values, but rather in their value relative to other Orion scores. We therefore sought to detect stretches of sequence that are empirically matched in their characteristics to known highly intolerant protein-coding exons. We used model-controlled flooding [16] (MCF), a methodology to detect stretches of sequence that fit a particular set of criteria. We set these criteria to match the score population of the most intolerant exons (Methods). Thus, we defined Orion regions as stretches of 100 to 1000 base pairs with a minimum mean Orion score and minimum median Orion score of -0.08, and containing no Orion score less than -0.1. These parameters can be tuned by the user, depending on the type of regions that need to be detected. The code implementing the MCF is provided on GitHub (https://github.com/igm-team/orion-public). Using these criteria, we generated a set of Orion regions. We then filtered these regions to remove any overlap with repeat regions (Methods). The Orion regions occupy a total of 4% of the non-repetitive autosomal genome (S5 Data File). Though the regions in this set are empirically matched through their Orion scores to the most intolerant protein-coding exons, 91% of the sequence within the Orion regions does not fall within CCDS. Therefore, these regions denote a portion of the non-coding genome that is highly intolerant.