Starting with an expanded and improved compendium of sRNA transfection datasets, we identified 14 features that each correlate with target repression and add predictive value when incorporated into a quantitative model of miRNA targeting efficacy. This model performed better than previous models and at least as well as the best high-throughput CLIP approaches.

Because our model was trained on data derived from a single cell type, a potential concern was its generalizability to other cell types. Heightening this concern is the recent report of widespread dependency of miRNA-mediated repression on cellular context (Erhard et al., 2014). However, other work addressing this question shows that after accounting for the different cellular repertoires of expressed mRNAs, the target response is remarkably consistent between different cell types, with alternative usage of 3′-UTR isoforms being the predominant mechanism shaping cell-type-specific differences in miRNA targeting (Nam et al., 2014). Testing the model across diverse cell types confirmed its generalizability; it performed at least as well as the best high-throughput CLIP approaches in each of the contexts examined (Figure 6). Of course, this testing was restricted to only those predicted targets that were expressed in each cellular context. Likewise, to achieve this highest level of performance, any future use of our model or its predictions would also require filtering of the predictions to focus on only the miRNAs and mRNAs co-expressed in the cells of interest.

One of the more interesting features incorporated into the context++ model is SA (the predicted structural accessibility of the site). Freedom from occlusive mRNA structure has long been considered a site-efficacy determinant (Robins et al., 2005; Ameres et al., 2007; Kertesz et al., 2007; Long et al., 2007; Tafer et al., 2008) and proposed as the underlying mechanistic explanation for the utility of other features, including global 3′-UTR AU content (Robins and Press, 2005; Hausser et al., 2009), local AU content (Grimson et al., 2007; Nielsen et al., 2007), minimum distance of the site (Grimson et al., 2007), and 3′-UTR length (Hausser et al., 2009; Betel et al., 2010; Wen et al., 2011; Reczko et al., 2012). The challenge has been to predict and score site accessibility in a way that is informative after controlling for local AU content, which is important for speaking to the importance of less occlusive secondary structure as opposed to involvement of some AU-binding activity (Grimson et al., 2007). The selection of the SA feature in all 1000 bootstrap samples of all four site types showed that it provided discriminatory power apart from that provided by local AU content and other correlated features, which reinforced the idea that the occlusive RNA structure does indeed limit site efficacy. This being said, local AU content, minimum distance of the site, and 3′-UTR length were each also selected in nearly all 1000 bootstrap samples for most site types (Table 1), which suggests that either these features were selected for reasons other than their correlation with site accessibility or the definition and scoring of our SA feature has additional room for improvement.

Our ability to confidently identify additional features that each contribute to improved prediction of targeting efficacy was enhanced by our pre-processing of the experimental datasets, which minimized variation from biases unrelated to the sRNA sequence. Yet despite applying this same normalization procedure to our test set, the observed r2 value of 0.14 implied that our model explained only 14% of the variability observed among mRNAs with canonical 7–8 nt 3′-UTR sites (Figure 4B). The r2 value increased to 0.15 when considering the usage of alternative 3′-UTR isoforms, but 85% of the variability remained unexplained. Error in the microarray measurements, different sRNA transfection efficiencies, variable incorporation of sRNAs into the silencing complex, and secondary effects of introducing the sRNA presumably made major contributions to the unexplained variability. Nonetheless, imperfections of the context++ model also contributed, raising the question of how much the model might be improved by identifying additional features or developing better methods for scoring and combining existing features. In analyses not described, we evaluated the utility of other types of regression (e.g., linear regression models with interaction terms, lasso/elastic net-regularized regression, multivariate adaptive regression splines, random forest, boosted regression trees, and iterative Bayesian model averaging) and found their performance to be comparable to that of stepwise regression but their resulting models to be considerably more complex and thus less interpretable.

One way to evaluate the extent to which the context++ model might be improved is to consider the degree to which its performance depends on the site-conservation feature. Because sites under selective pressure preferentially possess molecular features required for efficacy, inclusion of the site-conservation feature indirectly recovers some of the information that would otherwise be lost when informative molecular features are missing or imperfectly scored. As more informative molecular features are identified and included in a model, less information remains to be captured, and thus the site-conservation feature cannot contribute as much to the performance of the model. The site-conservation feature (P CT ) was chosen in all 1000 bootstrap samples of each of the three major site types, which showed that the molecular features of our model still do not fully capture all the determinants under selective pressure. However, P CT was not one of the most informative features (Figure 4C). Moreover, when tested as in Figure 5B, a model trained on only site type and the other 13 molecular features performed nearly as well as the full context++ model (r2 of 0.126, compared to 0.139 for the full model). This drop in r2 of only 0.013 was substantially less than the 0.044 r2 observed for the site-conservation feature on its own (Figure 5B, TargetScan.P CT ), which suggested that when predicting the response of the test-set mRNAs with the major canonical site types, the context++ model captured 70% (calculated as [0.044–0.013]/0.044) of the information potentially imparted by molecular features.

The relatively minor contribution of site conservation highlights the ability of the context++ model to predict the efficacy of nonconserved sites. Although, everything else being equal, its score for a conserved site is slightly better than that for a nonconserved site, this difference does not prevent inclusion of nonconserved sites from the top predictions. Its general applicability to all canonical sites is useful for evaluating not only nonconserved sites to conserved miRNAs but also all sites for nonconserved miRNAs (e.g., Figure 6K,L), including viral miRNAs, as well as the off-targets of synthetic siRNAs and shRNAs.

Our analyses show that recent computational and experimental approaches, including the different types of CLIP, all fail to identify non-canonical targets that are repressed more than control transcripts (Figures 1, 5C,F), which reopens the question of whether more than a miniscule fraction of miRNA-mediated repression is mediated through non-canonical sites. Although CLIP approaches can identify non-canonical sites that bind the miRNA with some degree of specificity (Figure 2), these non-canonical binding sites do not function to mediate detectable repression. Thus far, the only functional non-canonical sites that can be predicted are 3′-compensatory sites, cleavage sites, and centered sites, which together comprise only a very small fraction (<1%) of the functional sites that can be predicted with comparable accuracy (Bartel, 2009; Shin et al., 2010). The failure of computational methods to find many functional non-canonical sites cannot rule out the possibility that many of these sites might still exist; if such sites are recognized through unimagined determinants, computational efforts might have missed them. CLIP approaches, on the other hand, provide information that is independent of proposed pairing rules or other hypothesized recognition determinants. Therefore, our analyses of the CLIP results, which detected no residual repression after accounting for canonical interactions, provide the most compelling evidence to date on this issue. Unless there is a substantial technical bias in the CLIP approach (such as a large unanticipated disparity in the propensity of non-canonical interactions to crosslink), the inability of current CLIP approaches to identify non-canonical targets that are repressed more than control transcripts argues strongly against the existence of many functional non-canonical targets.

Why might the CLIP-identified non-canonical sites fail to mediate repression (Figure 1) despite binding the miRNA in vivo (Figure 2)? Perhaps these sites are ineffective because perfect seed pairing is required for repression. For example, perfect seed pairing might favor binding of a downstream effector, either directly by contributing to its binding site or indirectly through an Argonaute conformational change that favors its binding. However, this explanation is difficult to reconcile with the activity of 3′-compensatory and centered sites, which can mediate repression despite their lack of perfect seed pairing (Bartel, 2009; Shin et al., 2010), and the activity of Argonaute artificially tethered to an mRNA, which can mediate repression without any pairing to the miRNA (Pillai et al., 2004; Eulalio et al., 2008). Therefore, a more plausible explanation is that the CLIP-identified non-canonical sites bind the miRNA too transiently to mediate repression. This explanation for the inefficacy of the recently identified non-canonical sites in the 3′ UTRs resembles that previously proposed for the inefficacy of most canonical sites in ORFs: in both cases the ineffective sites bind to the miRNA very transiently—the canonical sites in ORFs dissociating quickly because of displacement by the ribosome (Grimson et al., 2007; Gu et al., 2009), and the CLIP-identified non-canonical sites in 3′ UTRs dissociating quickly because they lack both seed pairing and the extensive pairing outside the seed characteristic of effective non-canonical sites (3′-compensatory and centered sites) and thus have intrinsically fast dissociation rates.

The idea that newly identified non-canonical sites bind the miRNA too transiently to mediate repression raises the question of how CLIP could have identified so many of these sites in the first place; shouldn't crosslinking be a function of site occupancy, and shouldn't occupancy be a function of dissociation rates? The answers to these questions partially hinge on the realization that the transcriptome has many more non-canonical binding sites than canonical ones. The motifs identified in the non-canonical interactions have information contents as low as 5.6 bits, and thus are much more common in 3′ UTRs than canonical 6mer or 7mer sites (12 bits and 14 bits, respectively). This high abundance of the non-canonical binding sites would help offset the low occupancy of individual non-canonical sites, such that at any moment more than half of the bound miRNA might reside at non-canonical sites, yielding more non-canonical than canonical sites when using experimental approaches with such high specificity that they can identify a site with only a single read (Figure 2—figure supplement 1A).

Although the high abundance of non-canonical sites partly explains why CLIP identifies these sites in such high numbers, it cannot provide the complete answer. Some non-canonical sites in the CLASH and chimera datasets are supported by multiple reads, and all the dCLIP-identified non-canonical sites of the miR-155 study (Loeb et al., 2012) are supported by multiple reads. How could some CLIP clusters with ineffective, non-canonical sites have as much read support as some with effective, canonical sites? Our answer to this question rests on the recognition that cluster read density does not perfectly correspond to site occupancy (Friedersdorf and Keene, 2014), with the other key factors being mRNA expression levels and crosslinking efficiency. In principle, normalizing the CLIP tag numbers to the mRNA levels minimizes the first factor, preventing a low-occupancy site in a highly expressed mRNA from appearing as well supported as a high-occupancy site in a lowly expressed mRNA (Chi et al., 2009; Jaskiewicz et al., 2012). Accounting for differential crosslinking efficiencies is a far greater challenge. RNA–protein UV crosslinking is expected to be highly sensitive to the identity, geometry, and environment of the crosslinking constituents, leading to the possibility that the crosslinking efficiency of some sites is orders of magnitude greater than that of others. When considered together with the high abundance of non-canonical sites, variable crosslinking efficiency might explain why so many ineffective non-canonical sites are identified. Overlaying a wide distribution of crosslinking efficiencies onto the many thousands of ineffective, non-canonical sites could yield a substantial number of sites at the high-efficiency tail of the distribution for which the tag support matches that of effective canonical sites. Similar conclusions are drawn for other types of RNA-binding interactions when comparing CLIP results with binding results (Lambert et al., 2014).

Variable crosslinking efficiency also explains why many top predictions of the context++ model are missed by the CLIP methods, as indicated by the modest overlap in the CLIP identified targets and the top predictions (Figure 6). The crosslinking results are not only variable from site to site, which generates false negatives for perfectly functional sites, but they are also variable between biological replicates (Loeb et al., 2012), which imposes a challenge for assigning dCLIP clusters to a miRNA. Although this challenge is mitigated in the CLASH and chimera approaches, which provide unambiguous assignment of the miRNAs to the sites, the ligation step of these approaches occurs at low frequency and presumably introduces additional biases, as suggested by the different profile of non-canonical sites identified by the two approaches (Figure 2B and Figure 2—figure supplement 1A). For example, CLASH identifies non-canonical pairing to the 3′ region of miR-92 (Helwak et al., 2013), whereas the chimera approach identified non-canonical pairing to the 5′ region of this same miRNA (Figure 2C). Because of the false negatives and biases of the CLIP approaches, the context++ model, which has its own flaws, achieves an equal or better performance than the published CLIP studies.

Our observation that CLIP-identified non-canonical sites fail to mediate repression reasserts the primacy of canonical seed pairing for miRNA-mediated gene regulation. Compared to canonical sites, effective non-canonical sites (i.e., 3′-compensatory sites and centered sites) are rare because they require many more base pairs to the miRNA (Bartel, 2009; Shin et al., 2010) and thus together make up <1% of the effective target sites predicted to date. The requirement of so much additional pairing to make up for a single mismatch to the seed is proposed to arise from several sources. The advantage of propagating continuous pairing past miRNA nucleotide 8 (as occurs for centered sites) might be largely offset by the cost of an unfavorable conformational change (Bartel, 2009; Schirle et al., 2014). Likewise, the advantage of resuming pairing at the miRNA 3′ region (as occurs for 3′-compensatory sites) might be partially offset by either the relative disorder of these nucleotides (Bartel, 2009) or their unfavorable arrangement prior to seed pairing (Schirle et al., 2014). In contrast, the seed backbone is pre-organized to favor A-form pairing, with bases of nucleotides 2–5 accessible to nucleate pairing (Nakanishi et al., 2012; Schirle and MacRae, 2012). Moreover, perfect pairing propagated through miRNA nucleotide 7 creates the opportunity for favorable contacts to the minor groove of the seed:target duplex (Schirle et al., 2014).

Our overhaul of the TargetScan website integrated the output of the context++ model with the most current 3′-UTR-isoform data to provide any biologist with an interest in either a miRNA or a potential miRNA target convenient access to the predictions, with an option of downloading code or bulk output suitable for more global analyses. In our continuing efforts to improve the website, several additional functionalities will also soon be provided. To facilitate the exploration of co-targeting networks involving multiple miRNAs (Tsang et al., 2010; Hausser and Zavolan, 2014), we will provide the option of ranking predictions based on the simultaneous action of several independent miRNA families, to which relative weights (e.g., accounting for relative miRNA expression levels or differential miRNA activity in a cell type of interest) can be optionally assigned. To offer predictions for transcripts not already in the TargetScan database (e.g., novel 3′ UTRs or long non-coding RNAs, including circular RNAs), we will provide a mechanism to compute context++ scores interactively for a user-specified transcript. Likewise, to offer predictions for a novel sRNA sequence (e.g., off-target predictions for an siRNA), we will provide a mechanism to retrieve context++ scores interactively for a user-specified sRNA. To visualize the expression signature that results from perturbing a miRNA, we will provide a tool for the user to input mRNA/protein fold changes from high-throughput experiments and obtain a cumulative distribution plot showing the response of predicted targets relative to that of mRNAs without sites. Thus, with the current and future improvements to TargetScan, we hope to enhance the productivity of miRNA research and the understanding of this intriguing class of regulatory RNAs.