This inspection of non-canonical splice sites annotated in plant genome sequences was performed to capture the diversity and to assess the validity of these annotations, because previous studies indicate that annotations of non-canonical splice sites are a mixture of artifacts and bona fide splice sites [29, 34, 51]. Our results update and expand previous systematic analyses of non-canonical splice sites in smaller data sets [29, 30, 33, 34]. An extended knowledge about non-canonical splice sites in plants could benefit gene predictions [30, 52], as novel genome sequences are often annotated by lifting an existing annotation.

Confirmation of bona fide splicing from minor non-canonical combinations

Our analyses supported a variety of different non-canonical splice sites matching previous reports of bona fide non-canonical splice sites [29, 30, 34, 51]. Frequencies of different minor non-canonical splice site combinations are not random and vary between different combinations. Those combinations similar to the canonical combination or the major non-canonical splice site combinations are more frequent. Furthermore, our RNA-Seq analyses demonstrate the actual use of non-canonical splice sites, revealing a huge variety of different transcripts derived from non-canonical splice sites, which may be evolutionarily significant. Although some non-canonical splice sites may be located in pseudogenes, the transcriptional activity and accurate splicing at most non-canonical splice sites indicates functional relevance e.g. by contributing to functional diversity as previously postulated [2, 25, 26]. These findings are consistent with published reports that have demonstrated functional RNAs generated from non-canonical splice sites [30, 53].

In general, the pattern of non-canonical splice sites is very similar between species with major non-canonical splice sites accounting for most cases of non-canonical splicing. While the average across plants of 98.7% GT-AG canonical splice sites is in agreement with recent reports for A. thaliana [30], it is slightly lower than 99.2% predicted for mammals [33] or 99.3% as previously reported for Arabidopsis based on cDNAs [54]. In contrast, the frequency of major non-canonical GC-AG splice sites in plants is almost twice the value reported for mammals [33]. Most importantly the proportion of 0.09% minor non-canonical splice site combinations in plants is substantially higher than the estimation of 0.02% initially reported for mammals [33]. Taking these findings together, both major and minor non-canonical splice sites could be a more significant phenomenon of splicing in plants than in animals. This hypothesis would be consistent with the notion that splicing in plants is a more complex and diverse process than that occurring in metazoan lineages [55,56,57]. An in-depth investigation of non-canonical splice sites in animals and fungi would be needed to validate this hypothesis.

Species-specific differences in minor non-canonical splice site combinations

As previous studies on non-canonical splice sites were often focused on one species [54] or a few model organisms [33, 34, 38], the observed variation among the plant genomes investigated here updates the current knowledge and revealed potential species-specific differences. However, small numbers of non-canonical splice sits in some species might prevent the detection of phylogenetic patterns in the genome-wide analysis. Nevertheless, conserved non-canonical splice site positions exist as presented on the gene level for At1g79350. Differences in the availability of hints in the gene prediction process and variation in the assembly quality might contribute to the observed differences in the number of non-canonical splice sites between closely related species.

The group of minor non-canonical splice sites displayed the largest variation between species, and a frequent non-canonical splice site combination (CA-GG) which appeared peculiar to O. sativa is probably due to an alignment error. In other words, the predicted CA-GG splice site combinations in rice can be conceived as major non-canonical GC-AG events by just splitting the transcript sequence in a different way during the alignment over the intron. An additional downstream G at the 3′ splice site seems to be responsible for leading to this annotation, because cases where GC-AG was correctly annotated do not display this G in the respective position. Dedicated alignment tools are needed to bioinformatically distinguish these events [58], otherwise manual inspection must be used to correctly resolve these situations.

Despite all artifacts described here and elsewhere [29, 33, 59], non-canonical splice sites seem to have conserved functions as indicated by conservation over long evolutionary periods displayed as presence in homologous sequences in multiple species [23, 29]. Our own analyses across multiple accessions of A. thaliana support this conjecture and suggest that some non-canonical splice sites are conserved in homologous loci at the intra-specific level. At the same time, there is intra-specific variability [30] that might be attributed to the accumulation of mutations prior to purifying selection. Assessing the variability within a species could be an additional approach to distinguish bona fide splice sites from artifacts or recent mutations.

Putative mechanisms for processing of minor non-canonical splice sites

We sought to understand possible correlations with minor non-canonical splice site combinations in order understand the mechanisms driving their occurrence. Therefore, we explored the impact of genomic position relative to centromeres, the effect of increased gene number, and the impact of intron length. The occurrence of non-canonical splice sites is reduced with proximity to the centromere, but this is likely due to reduced gene content in centromeric regions. Averaged across all species, there is a significantly higher proportion of non-canonical sites in single copy genes, but species-specific differences also violate this observation, suggesting that gene copy number is not an important determinant. However, non-canonical splice sites may be more important in splicing very long introns, because they appear in introns above 5 kb with a higher relative likelihood than canonical splice sites. Further investigations are needed to validate the observed lack of G in these splice site combinations and to identify an underlying pattern if it exists. When looking for an association of long introns with stress-related genes, no significant increase in their intron sizes was observed. However, it is still possible that these long introns belong to genes which were not previously described in relation to stress.

Previous studies postulated different non-spliceosomal removal mechanisms for such introns including the IRE1 / tRNA ligase system [60, 61] and short direct repeats leading to transcriptional slippage [62, 63]. It should be mentioned that many sequence variants of snRNAs are encoded in plant genomes [64]. The presence of multiple spliceosome types in addition to the canonical U2 and the non-canonical U12 spliceosome could be another explanation [38].

Another hypothesis suggests parasitic splice sites using neighbouring recognition sites for the splicing machinery to enable their processing [33]. The mere presence of GT close to the 5′ non-canonical splice site and AG close to the 3′ non-canonical splice site might be sufficient for this process to take place. These non-canonical splice sites are expected to be in frame with the associated GT-AG signals which could be responsible for recruiting the splicing machinery [33]. This hypothesis is supported by the observation that splice sites seem to be missed sometimes thus leading to the use of the next splice site which is usually in frame with the original one [54]. Further investigation might connect neighbouring sequences to the processing of minor non-canonical splice sites.

There is no evidence for RNA editing to modify splice sites yet, but previous studies found that modifications of mRNAs are necessary to enable proper splicing in some cases [65]. Even so such a system is probably not in place for all minor non-canonical splice sites, a modification of nucleotides in the transcript would be another way to regulate gene expression at the post-transcriptional level.

Although, these hypotheses could be an additional or alternative explanations for the situation observed in O. sativa, considering the CA-GG cases as annotation and alignment errors seems more likely due to their unique presence in this species.

Usage of non-canonical splice sites

Our results could provide a strong foundation to further analyses of the splicing process by providing detailed information about the frequency at which splicing occurred at a certain splice site. The results indicate that this usage of different splice site types could vary substantially. A possible explanation for these observed differences is the mixture of RNA-Seq data sets, which contains samples from various tissues and different environmental or physiological conditions. Sequencing reads reflect the splicing events occurring under these specific conditions. As previously indicated by several reports, non-canonical splice sites might be more frequently used under stress conditions [25, 51, 63]. As most plants are unable to escape environmental conditions by movement, a higher frequency of non-canonical splice sites in sessile plant species compared to other taxonomic groups should be assessed in the future to explore whether there may be a link between non-canonical splice frequency and life habit.

The observation of a stronger usage of the donor splice site over the acceptor splice site in GT-AG and GC-AG splice site combinations is matching previous reports where one donor splice site can be associated with multiple acceptor splice sites [54, 66]. The absence of this effect at minor non-canonical splice site combinations might hint towards a different splicing mechanism, which is restricted to precisely one combination of donor and acceptor splice site.

The observed usage of GT-TA and AA-TA splice site combinations under stress conditions in contrast to control conditions as well as the slight increase in the number of supported minor non-canonical splice site combinations requires further testing e.g. in other species or under different stress conditions. It would be interesting to validate the usage of different splice sites in response to stress and not just the expression of stress-related genes. In principle, it would be possible to assess the usage of splice sites under diverse environmental or developmental conditions as performed in this study for different plant species. While numerous RNA-Seq data sets are available per species, these analyses would require a large number of data sets generated under identical or at least similar conditions. Therefore, the identification of splicing variants dedicated to certain stress responses is beyond the scope of this work.

Limitations of the current analyses

Some constraints limit the power of the presented analyses. In accordance with the important plant database Araport11 [37] and previous analyses [30], only the transcript encoding the longest peptide sequence was considered per gene when investigating splice site conservation across species. Although the exclusion of alternative transcripts was necessary to compensate differences in the annotation quality, more non-canonical splice sites could be revealed by investigations of all transcript versions in the future. The exclusion of annotated introns shorter than 20 bp as well as the minimal intron length cutoff of 20 bp during the RNA-Seq read mapping prevented the investigation of very small introns. There are reports of experimentally validated introns with a minimal length of 56 bp [67]. Although recent reports indicate a minimal intron length around 30 bp in humans [68] or even down to 10 bp [51], it is unclear if very short sequences should be called introns. Since spliceosomal removal of these very short sequences via lariat formation seems unlikely, a new terminology might be needed. The applied length cutoff was selected to avoid previously reported issues with false positives [51]. However, de novo identification of very short introns as recently performed for Mus musculus and H. sapiens [51, 69] could become feasible as RNA-Seq data sets based on similar protocols become available for a broad range of plant species. Variations between RNA-Seq samples posed another challenge. Since there is a substantial amount of variation within species [70, 71], we can assume that small differences in the genetic background of the analyzed material could bias the results. Splice sites of interest might be canonical splice site combinations in some accessions or subspecies, respectively, while they are non-canonical in others. Despite our attempts to collect RNA-Seq samples derived from a broad range of different conditions and tissues for each species, data of many specific physiological states are missing for most species. Therefore, we cannot exclude that certain non-canonical splice sites were missed in our splice site usage analysis due to a lack of gene expression under the investigated conditions.

Future perspectives

As costs for RNA-Seq data generation drops over the years [72], improved analyses will become possible over time. Investigation of homologous non-canonical splice sites poses several difficulties, as the exonic sequence is not necessarily conserved. Due to upstream changes in the exon-intron structure [73], the number of the non-canonical introns can differ between species. However, a computationally feasible approach to investigate the phylogeny of all non-canonical splice sites would significantly enhance our knowledge e.g. about the emergence and loss of non-canonical splice sites. Experimental validation of splice sites in vivo and in vitro could be the next step. It is crucial for such analyses to avoid biases introduced by reverse transcription artifacts e.g. by comparing different enzymes and avoiding random hexameters during cDNA synthesis [74]. Splice sites could be experimentally validated e.g. by integration in the Aequoria vicotria GFP sequence [75] to see if they are functional in plants. Our analyses support the concept that differences between plant species need to be taken into account when performing such investigations [76, 77].