In contrast to existing estimates of approximately 200 murine imprinted genes, recent work based on transcriptome sequencing uncovered parent-of-origin allelic effects at more than 1,300 loci in the developing brain and two adult brain regions, including hundreds present in only males or females. Our independent replication of the embryonic brain stage, where the majority of novel imprinted genes were discovered and the majority of previously known imprinted genes confirmed, resulted in only 12.9% concordance among the novel imprinted loci. Further analysis and pyrosequencing-based validation revealed that the vast majority of the novel reported imprinted loci are false-positives explained by technical and biological variation of the experimental approach. We show that allele-specific expression (ASE) measured with RNA–Seq is not accurately modeled with statistical methods that assume random independent sampling and that systematic error must be accounted for to enable accurate identification of imprinted expression. Application of a robust approach that accounts for these effects revealed 50 candidate genes where allelic bias was predicted to be parent-of-origin–dependent. However, 11 independent validation attempts through a range of allelic expression biases confirmed only 6 of these novel cases. The results emphasize the importance of independent validation and suggest that the number of imprinted genes is much closer to the initial estimates.

Typically both copies of mammalian genes are expressed, but in some cases, “imprinting” restricts expression to the maternal or paternal copy. Having two copies of each gene is considered advantageous since in enables compensation when one does not function properly. Why imprinting evolved and its utility to each sex is widely debated, and having a complete catalog of imprinted genes and their functions is essential for fully characterizing this phenomenon. 25 years of screening has revealed about 130 imprinted genes, and the slowing rate of discovery suggests that we are reaching saturation. Two recent studies based on high-throughput sequencing of RNA reported more than 1,300 imprinted genes. To understand the basis of this paradigm shift, we first attempted to reproduce these results. Unable to do so, we performed additional analyses that show that most of these discoveries are due to noise in the experimental approach and assay. We remedy this with new methods that account for this noise and applied them to identify 50 novel putative imprinted genes. These methods will be useful for identifying genuine novel cases of imprinted expression as this type of screening approach becomes broadly utilized.

To investigate the biological robustness of these novel imprinted loci, we repeated the embryonic brain screen. Despite faithful technical reproduction of the experimental design, library construction, sequencing, and analysis, we could not reproduce the majority of novel imprinted genes. In this study we demonstrate that biological variation in the approach and technical variation of the assay introduce considerably more noise than was appreciated previously. We develop methods to account for this variation and demonstrate their utility through reanalysis of the published data mentioned above as well as new embryonic brain data.

Transcriptome sequencing of F1 mouse hybrids provides an unbiased alternative for discovering imprinted transcription in wild-type animals [14] , [15] . The approach is based on detecting allelic expression with RNA sequencing reads that map over heterozygous SNPs, where the identity of the base is used to distinguish allelic origin and a reciprocal cross is used to discriminate parent-of-origin from strain-specific biases. The first two applications of this approach yielded a small number of novel imprinted transcripts each [14] , [15] . However, two recent studies used this approach to identify more than 1,300 imprinted loci, including 484 noncoding RNAs and 347 genes that were sex-specific [16] , [17] . These 1,300 loci are an aggregate of the discoveries from E15 brains, adult medial prefrontal cortex (PFC) and adult preoptic area (POA), and represent a ten-fold increase over previously known imprinted genes. The authors suggest that improved sensitivity from increased sequencing depth and improved resolution from sequencing the parents for de novo identification of SNPs enabled these advances.

Imprinting was initially characterized with genetics. Regions where uniparental disomy is not tolerated were mapped by intercrossing reciprocal translocations and comparing viability when both copies of a genomic segment were inherited from one parent to viability when inherited from the other [1] . Continued use of complementation tests in function-based screens provided the most dramatic examples of parent-of-origin effects on development, implicating imprinted transcription in neonatal behavior as well as pre-natal and post-natal growth [2] . Imprinting was then formally demonstrated using nuclear transfer experiments, where aberrant development was observed following transfer of two pronuclei from parents of the same sex [3] . Subsequently, the first individual imprinted transcripts were mapped [4] – [7] . After screening many translocations, genome-wide estimates stood at 100–200 imprinted transcripts [8] . Although based on what we now know was an overestimate for the total number of genes (60,000–100,000) and an underestimate of the number of known imprinted clusters [8] , [9] , the ∼20 year-old estimate has endured all screening methods applied, including those that do not depend on overt phenotypes. Some of these screens revealed that imprinting occurs outside of the growth axis and affects transcription unrelated to viability or growth [10] – [13] . By Dec 2008, genetic and molecular screening efforts combined yielded 128 confirmed imprinted genes in the mouse.

Why diploid organisms would forgo the safety-net of a redundant genome and preferentially express one allele in a parent-of-origin dependent manner has been a matter of debate since the discovery of imprinted transcription. Our understanding of this issue as well as the range of processes affected by imprinting is dependent on our catalog of the identity, function, and spatial-temporal specificity of imprinted genes.

Results

Novel imprinted loci in embryonic brain do not replicate To distinguish known from novel imprinted loci we compiled a catalog of genomic coordinates for all 128 known mouse imprinted genes that we were able to recover from the literature [13]–[15], [18]–[23] (accessible from GEO [24] under accession GSE27016). All 128 were published prior to the recent papers reporting many more imprinted loci [16], [17]. E15 brains yielded considerably more novel imprinted genes than either the adult PFC or POA (553 vs 153 and 256 respectively), providing the richest opportunity to test reproducibility [17]. Aside from an inexact match in developmental time points (E17.5 vs E15), our experiment was a faithful reproduction of the approach used by Gregg et al. We both used brains from reciprocally crossed C57Bl/6J (B) and CAST/EiJ (C) F1s (from here on BxC will be used to describe F1s derived from B mother and C father; CxB will denote the reciprocal). We both constructed sequencing libraries using the standard Illumina RNA-Seq protocol and sequenced them to 36 bp (single-end) on an Illumina platform. We both used Novoalign (www.novocraft.com) to map reads to UCSC mouse transcripts and noncoding RNAs from the functional RNA database [25]. We used the same set of SNPs [17] and the same criteria for identifying imprinted transcripts (i.e. containing at least one SNP with 10 or more reads with reciprocally biased expression, p<0.05; chi-square test). We observed 100% agreement on known imprinted gene calls in E15 brain, POA, and PFC [17], confirming that our analyses were consistent. We detected 38 and 42 known imprinted genes in E17.5 and E15 data respectively. 32/42 (76.2%) were detected in both samples (0.1 transcripts expected by chance; Figure 1). This was in sharp contrast to 51/396 (12.9%, 24 expected by chance) novel imprinted genes that confirmed in our screen (Figure 1). This discrepancy is not inconsistent with the experimental validation carried out on novel imprinted genes by Gregg et al.: included in these 51 replicating genes were 2/2 with no previous evidence of imprinting (Eif2c2 and DOKist) that were discovered and further validated in E15 brain [17]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Independent replication with E17.5 brains recapitulates 76.2% (32/42, 0.1 expected by chance) known and 12.9% (51/396, 24 expected by chance) novel imprinted genes reported previously Independent replication with E17.5 brains recapitulates 76.2% (32/42, 0.1 expected by chance) known and 12.9% (51/396, 24 expected by chance) novel imprinted genes reported previously [17] 12,171 (E15) and 10,418 (E17.5) genes were sufficiently powered for detection (≥1 SNP with ≥10 reads), of which 10,222 were in both samples. Reanalysis excluding SNPs detected in parental transcriptomes while relying exclusively on public SNPs marginally changes the outcome of imprint detection (numbers shown in brackets). https://doi.org/10.1371/journal.pgen.1002600.g001 To investigate the benefits of sequencing parents to identify heterozygous SNPs, we repeated this analysis using publicly available SNPs. Perlegen [26] used microarrays to resequence the CAST genome in 2007 and the Wellcome Trust Sanger Institute sequenced 17 mouse strains and released ∼19 million C57Bl/CAST SNPs in 2009 [27]. We converted SNP transcript coordinates published by Gregg et al. [17] to genomic coordinates (July 2007 NCBI 37/mm9) using coordinates of UCSC Known Genes [28]. 99.96% SNPs [17] mapped successfully to 136,532 unique positions. 88.9% of these were among the 19.6 million SNPs identified by Perlegen [26] and/or Sanger [27], of which 98.9% agreed on base identity. The transition∶transversion ratio of SNPs that agree with [26], [27] was 3.00, and 2.06 for the remaining 11% novel SNPs, suggesting that the novel SNPs reported by Gregg et al. [17] have a higher proportion of false-positives. False-positive SNPs cannot lead to an imprinting call since there would be no reads supporting the non-reference CAST allele. Reanalysis of the data using only the 88.9% SNPs that also exist in the public domain yielded nearly identical results (Figure 1; see numbers in parentheses) with no reduction in sensitivity for known imprinted genes and less than 3% sensitivity reduction in novel regions. This demonstrates that sequencing biological parents when SNPs are publicly available from sequenced parental strains provides little added benefit [26], [27].

Allele-specific expression (ASE) measured with RNA–Seq is exceedingly more noisy than accounted for by basic counting statistics Statistical modeling of allele-specific expression measured by transcriptome sequencing is an unresolved challenge [29], [30]. Gregg et al. [16], [17] used a chi-square metric that assumes no experimental biases are introduced during library construction, sequencing, genomic alignment, and that each sequencing read is independent of all other reads. Unfortunately these assumptions are often violated [30]–[34], and systematic error in quantifying allele-specific expression by RNA-Seq is just now becoming apparent [29], [35]–[37]. To begin to understand the underlying cause of inconsistent imprinting calls we investigated the accuracy of ASE quantification with RNA-Seq. It has previously been shown that ASE measured at the same SNP is highly reproducible across technical and biological replicates [17], [38]. However, this comparison is immune to systematic error such as priming, fragmentation, and PCR biases that arise during library construction, sequencing chemistry [36], and read alignment [31]. Since most sources of systematic error are sequence dependent, a more informative test would compare concordance in ASE within the same sample, but between independently sampled sites where the level of ASE is the same. We reasoned that SNPs within the same coding exon should satisfy this requirement since there are few known biological phenomena that could disrupt this relationship. Allele-specific premature transcriptional termination is possible, but has to our knowledge not been documented. Furthermore, premature termination codons are relatively infrequent [39] and those introduced by alternative splicing lead to immediate degradation by nonsense-mediated decay [40]. We did not use exons containing UTRs since transcription start sites vary significantly [41] and 3′UTRs are extensively processed [42]. We also restricted our analysis to RefSeq coding exons, which are based on experimentally validated transcripts to maximize the accuracy of same-exon SNP associations. We estimated the accuracy of ASE quantification as the frequency of SNP pairs within the same exon that agree on direction of bias. We considered all SNP pairs separated by more than 40 bp, which is the longest used sequencing read length [17], to ensure independent sampling. At a p-value threshold of 0.0001 (i.e. both p-values in each SNP pair are less than 0.0001) we observed 1,388 SNP pairs in our E17.5 BxC data, of which 278 (20.03%) disagreed on direction of bias (Figure 2A). If ASE measured with RNA-Seq was adequately modeled by random sampling, we would expect less than 0.1% to be discordant at this level of significance (Figure 2B and see Materials and Methods). We observed exceedingly higher levels of discordance than expected across all levels of significance (Figure 2B). Nonetheless, p-values computed from a basic chi-square test are predictive since higher thresholds result in lower rates of discordance and do not reach an asymptote (within extent of available data). This indicates that: 1) our test is valid and not confounded by biological differences in ASE between SNPs in the same coding exon, and 2) counting statistics can be a reliable approach for identifying ASE, although at face-value lead to a major over-estimate of significance. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Direction of allele-specific expression bias measured at SNPs in the same exon disagrees more frequently than expected by chance. (A) Expression bias at 1,388 pairs of SNPs under ASE (chi-square p<1e-4 for both SNPs) disagrees in direction for 20.0% of pairs in E17.5 BxC data. (B) Discordance improves with increasing p-value thresholds but is higher than simulated data where allelic counts for one SNP in each pair are randomly sampled from the χ2 distribution (broken lines; see Materials and Methods). https://doi.org/10.1371/journal.pgen.1002600.g002

Discovery of novel imprinted genes To estimate the total number of imprinted genes we first asked how many are detected in the four available datasets (E15 brain, E17.5 brain, adult PFC and POA). Aggregating allele-specific reads across SNPs in the same gene improved our sensitivity in known imprinted regions (data not shown) and we thus applied this approach genome-wide. We also took advantage of all publicly available SNPs [17], [26], [27] and expanded our alignment reference to include the whole genome (see Materials and Methods). Using the mock/reciprocal approach to estimate false-discovery (Figure 5A), we proceeded with p = 1e-4 (FDR<0.05) as a threshold of significance for calling a gene imprinted (in addition to the standard reciprocal bias toward sex of parent and ≥10 reads in each cross). We selected p = 1e-4 as a threshold (e.g. as opposed to 1e-3) since candidates with scores between 1e-3 and 1e-4 did not validate by pyrosequencing (see below) and sensitivity is negligibly impacted (Figure 5A). We identified a total of 53 putative imprinted genes in at least one sample (Figure S2). 5/53 occurred in all 4 samples and 3 (Eif2c2, Cdh15, and DOKist4) were validated by Gregg et al [17]. We also detected 56 genes that were previously known to be imprinted (27 in all 4 samples). Of the putative novel genes, 4 are probable extensions of known imprinted genes based on EST or transcription evidence derived from this data (3 of the 5 putative novel imprinted genes which recur in all 4 samples), 7 others are associated with known imprinted clusters (within 1 Mb) and 42 are completely novel (Table S1). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Validating features predictive of genomic imprinting. (A) Estimated false-discovery rate as a function of p-value threshold. Allele-specific expression counts were summed over all SNPs within each gene. False-discovery was estimated as the number of significant genes in a mock cross (1 versus 3+2 vs 4; see Figure 3A) divided by the number of significant genes in a reciprocal cross (1 versus 4+2 versus 3). Sensitivity was computed as the number of known imprinted genes meeting significance in the reciprocal cross (1 versus 4+2 versus 3) as a fraction of imprinted genes powered for detection (at least one SNP with ≥10 reads in both animals). (B) Imprinting scores (left panel), allelic bias (middle panel), our imprinting call and pyrosequencing call (right panel) for 37 loci tested by pyrosequencing. Imprinting Scores are log 10 (p), where p is the less significant p-value from chi-square tests performed on the two reciprocal crosses; negatives arbitrarily represent paternal bias. Expression bias is the allelic bias measured in RNA-Seq data from the average of embryonic brain data (E15/E17.5). Imprinting call is based on an imprinting score threshold of 4 (p<1e-4) in E15 and/or E17.5 brains. Pyrosequencing validation was done on total RNA from E17.5 brains and calibrated on biological replicates (see Materials and Methods for details). Imprinted Loci reported by Gregg et al. represent 17 randomly selected SNPs reported imprinted in E15 [17]. https://doi.org/10.1371/journal.pgen.1002600.g005