The actual analysis of RNA-seq data has as many variations as there are applications of the technology. In this section, we address all of the major analysis steps for a typical RNA-seq experiment, which involve quality control, read alignment with and without a reference genome, obtaining metrics for gene and transcript expression, and approaches for detecting differential gene expression. We also discuss analysis options for applications of RNA-seq involving alternative splicing, fusion transcripts and small RNA expression. Finally, we review useful packages for data visualization.

Quality-control checkpoints

The acquisition of RNA-seq data consists of several steps — obtaining raw reads, read alignment and quantification. At each of these steps, specific checks should be applied to monitor the quality of the data (Fig. 1a).

Raw reads

Quality control for the raw reads involves the analysis of sequence quality, GC content, the presence of adaptors, overrepresented k-mers and duplicated reads in order to detect sequencing errors, PCR artifacts or contaminations. Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. FastQC [11] is a popular tool to perform these analyses on Illumina reads, whereas NGSQC [12] can be applied to any platform. As a general rule, read quality decreases towards the 3’ end of reads, and if it becomes too low, bases should be removed to improve mappability. Software tools such as the FASTX-Toolkit [13] and Trimmomatic [14] can be used to discard low-quality reads, trim adaptor sequences, and eliminate poor-quality bases.

Read alignment

Reads are typically mapped to either a genome or a transcriptome, as will be discussed later. An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. For example, we expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) [15], with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’). When reads are mapped against the transcriptome, we expect slightly lower total mapping percentages because reads coming from unannotated transcripts will be lost, and significantly more multi-mapping reads because of reads falling onto exons that are shared by different transcript isoforms of the same gene.

Other important parameters are the uniformity of read coverage on exons and the mapped strand. If reads primarily accumulate at the 3’ end of transcripts in poly(A)-selected samples, this might indicate low RNA quality in the starting material. The GC content of mapped reads may reveal PCR biases. Tools for quality control in mapping include Picard [16], RSeQC [17] and Qualimap [18].

Quantification

Once actual transcript quantification values have been calculated, they should be checked for GC content and gene length biases so that correcting normalization methods can be applied if necessary. If the reference transcriptome is well annotated, researchers could analyze the biotype composition of the sample, which is indicative of the quality of the RNA purification step. For example, rRNA and small RNAs should not be present in regular polyA longRNA preparations [10, 19]. A number of R packages (such as NOISeq [19] or EDASeq [20]) provide useful plots for quality control of count data.

Reproducibility

The quality-control steps described above involve individual samples. In addition, it is also crucial to assess the global quality of the RNA-seq dataset by checking on the reproducibility among replicates and for possible batch effects. Reproducibility among technical replicates should be generally high (Spearman R2 > 0.9) [1], but no clear standard exists for biological replicates, as this depends on the heterogeneity of the experimental system. If gene expression differences exist among experimental conditions, it should be expected that biological replicates of the same condition will cluster together in a principal component analysis (PCA).

Transcript identification

When a reference genome is available, RNA-seq analysis will normally involve the mapping of the reads onto the reference genome or transcriptome to infer which transcripts are expressed. Mapping solely to the reference transcriptome of a known species precludes the discovery of new, unannotated transcripts and focuses the analysis on quantification alone. By contrast, if the organism does not have a sequenced genome, then the analysis path is first to assemble reads into longer contigs and then to treat these contigs as the expressed transcriptome to which reads are mapped back again for quantification. In either case, read coverage can be used to quantify transcript expression level (Fig. 1b). A basic choice is whether transcript identification and quantification are done sequentially or simultaneously.

Alignment

Two alternatives are possible when a reference sequence is available: mapping to the genome or mapping to the annotated transcriptome (Fig. 2a, b; Box 3). Regardless of whether a genome or transcriptome reference is used, reads may map uniquely (they can be assigned to only one position in the reference) or could be multi-mapped reads (multireads). Genomic multireads are primarily due to repetitive sequences or shared domains of paralogous genes. They normally account for a significant fraction of the mapping output when mapped onto the genome and should not be discarded. When the reference is the transcriptome, multi-mapping arises even more often because a read that would have been uniquely mapped on the genome would map equally well to all gene isoforms in the transcriptome that share the exon. In either case — genome or transcriptome mapping — transcript identification and quantification become important challenges for alternatively expressed genes.

Fig. 2 Read mapping and transcript identification strategies. Three basic strategies for regular RNA-seq analysis. a An annotated genome is available and reads are mapped to the genome with a gapped mapper. Next (novel) transcript discovery and quantification can proceed with or without an annotation file. Novel transcripts are then functionally annotated. b If no novel transcript discovery is needed, reads can be mapped to the reference transcriptome using an ungapped aligner. Transcript identification and quantification can occur simultaneously. c When no genome is available, reads need to be assembled first into contigs or transcripts. For quantification, reads are mapped back to the novel reference transcriptome and further analysis proceeds as in (b) followed by the functional annotation of the novel transcripts as in (a). Representative software that can be used at each analysis step are indicated in bold text. Abbreviations: GFF General Feature Format, GTF gene transfer format, RSEM RNA-Seq by Expectation Maximization Full size image

Transcript discovery

Identifying novel transcripts using the short reads provided by Illumina technology is one of the most challenging tasks in RNA-seq. Short reads rarely span across several splice junctions and thus make it difficult to directly infer all full-length transcripts. In addition, it is difficult to identify transcription start and end sites [21], and tools such as GRIT [22] that incorporate other data such as 5’ ends from CAGE or RAMPAGE typically have a better chance of annotating the major expressed isoforms correctly. In any case, PE reads and higher coverage help to reconstruct lowly expressed transcripts, and replicates are essential to resolve false-positive calls (that is, mapping artifacts or contaminations) at the low end of signal detection. Several methods, such as Cufflinks [23], iReckon [24], SLIDE [25] and StringTie [26], incorporate existing annotations by adding them to the possible list of isoforms. Montebello [27] couples isoform discovery and quantification using a likelihood-based Monte Carlo algorithm to boost performance. Gene-finding tools such as Augustus [28] can incorporate RNA-seq data to better annotate protein-coding transcripts, but perform worse on non-coding transcripts [29]. In general, accurate transcript reconstruction from short reads is difficult, and methods typically show substantial disagreement [29].

De novo transcript reconstruction

When a reference genome is not available or is incomplete, RNA-seq reads can be assembled de novo (Fig. 2c) into a transcriptome using packages such as SOAPdenovo-Trans [30], Oases [31], Trans-ABySS [32] or Trinity [33]. In general, PE strand-specific sequencing and long reads are preferred because they are more informative [33]. Although it is impossible to assemble lowly expressed transcripts that lack enough coverage for a reliable assembly, too many reads are also problematic because they lead to potential misassembly and increased runtimes. Therefore, in silico reduction of the number of reads is recommended for deeply sequenced samples [33]. For comparative analyses across samples, it is advisable to combine all reads from multiple samples into a single input in order to obtain a consolidated set of contigs (transcripts), followed by mapping back of the short reads for expression estimation [33].

Either with a reference or de novo, the complete reconstruction of transcriptomes using short-read Illumina technology remains a challenging problem, and in many cases de novo assembly results in tens or hundreds of contigs accounting for fragmented transcripts. Emerging long-read technologies, such as SMRT from Pacific Biosciences, provide reads that are long enough to sequence complete transcripts for most genes and are a promising alternative that is discussed further in the “Outlook” section below.

Transcript quantification

The most common application of RNA-seq is to estimate gene and transcript expression. This application is primarily based on the number of reads that map to each transcript sequence, although there are algorithms such as Sailfish that rely on k-mer counting in reads without the need for mapping [34]. The simplest approach to quantification is to aggregate raw counts of mapped reads using programs such as HTSeq-count [35] or featureCounts [36]. This gene-level (rather than transcript-level) quantification approach utilizes a gene transfer format (GTF) file [37] containing the genome coordinates of exons and genes, and often discard multireads. Raw read counts alone are not sufficient to compare expression levels among samples, as these values are affected by factors such as transcript length, total number of reads, and sequencing biases. The measure RPKM (reads per kilobase of exon model per million reads) [1] is a within-sample normalization method that will remove the feature-length and library-size effects. This measure and its subsequent derivatives FPKM (fragments per kilobase of exon model per million mapped reads), a within-sample normalized transcript expression measure analogous to RPKs, and TPM (transcripts per million) are the most frequently reported RNA-seq gene expression values. It should be noted that RPKM and FPKM are equivalent for SE reads and that FPKM can be converted into TPM using a simple formula [38]. The dichotomy of within-sample and between-sample comparisons has led to a lot of confusion in the literature. Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples, but it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads. Furthermore, programs such as Cufflinks that estimate gene length from the data can find significant differences in gene length between samples that cannot be ignored. TPMs, which effectively normalize for the differences in composition of the transcripts in the denominator rather than simply dividing by the number of reads in the library, are considered more comparable between samples of different origins and composition but can still suffer some biases. These must be addressed with normalization techniques such as TMM.

Several sophisticated algorithms have been developed to estimate transcript-level expression by tackling the problem of related transcripts’ sharing most of their reads. Cufflinks [39] estimates transcript expression from a mapping to the genome obtained from mappers such as TopHat using an expectation-maximization approach that estimates transcript abundances. This approach takes into account biases such as the non-uniform read distribution along the gene length. Cufflinks was designed to take advantage of PE reads, and may use GTF information to identify expressed transcripts, or can infer transcripts de novo from the mapping data alone. Algorithms that quantify expression from transcriptome mappings include RSEM (RNA-Seq by Expectation Maximization) [40], eXpress [41], Sailfish [35] and kallisto [42] among others. These methods allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases [35, 41, 43]. Additionally, the RSEM algorithm uses an expectation maximization approach that returns TPM values [40]. NURD [44] provides an efficient way of estimating transcript expression from SE reads with a low memory and computing cost.

Differential gene expression analysis

Differential expression analysis (Fig. 1b) requires that gene expression values should be compared among samples. RPKM, FPKM, and TPM normalize away the most important factor for comparing samples, which is sequencing depth, whether directly or by accounting for the number of transcripts, which can differ significantly between samples. These approaches rely on normalizing methods that are based on total or effective counts, and tend to perform poorly when samples have heterogeneous transcript distributions, that is, when highly and differentially expressed features can skew the count distribution [45, 46]. Normalization methods that take this into account are TMM [47], DESeq [48], PoissonSeq [49] and UpperQuartile [45], which ignore highly variable and/or highly expressed features. Additional factors that interfere with intra-sample comparisons include changes in transcript length across samples or conditions [50], positional biases in coverage along the transcript (which are accounted for in Cufflinks), average fragment size [43], and the GC contents of genes (corrected in the EDAseq package [21]). The NOISeq R package [20] contains a wide variety of diagnostic plots to identify sources of biases in RNA-seq data and to apply appropriate normalization procedures in each case. Finally, despite these sample-specific normalization methods, batch effects may still be present in the data. These effects can be minimized by appropriate experimental design [51] or, alternatively, removed by batch-correction methods such as COMBAT [52] or ARSyN [20, 53]. These approaches, although initially developed for microarray data, have been shown to work well with normalized RNA-seq data (STATegra project, unpublished).

As RNA-seq quantification is based on read counts that are absolutely or probabilistically assigned to transcripts, the first approaches to compute differential expression used discrete probability distributions, such as the Poisson or negative binomial [48, 54]. The negative binomial distribution (also known as the gamma-Poisson distribution) is a generalization of the Poisson distribution, allowing for additional variance (called overdispersion) beyond the variance expected from randomly sampling from a pool of molecules that are characteristic of RNA-seq data. However, the use of discrete distributions is not required for accurate analysis of differential expression as long as the sampling variance of small read counts is taken into account (most important for experiments with small numbers of replicates). Methods for transforming normalized counts of RNA-seq reads while learning the variance structure of the data have been shown to perform well in comparison to the discrete distribution approaches described above [55, 56]. Moreover, after extensive normalization (including TMM and batch removal), the data might have lost their discrete nature and be more akin to a continuous distribution.

Some methods, such as the popular edgeR [57], take as input raw read counts and introduce possible bias sources into the statistical model to perform an integrated normalization as well as a differential expression analysis. In other methods, the differential expression requires the data to be previously normalized to remove all possible biases. DESeq2, like edgeR, uses the negative binomial as the reference distribution and provides its own normalization approach [48, 58]. baySeq [59] and EBSeq [60] are Bayesian approaches, also based on the negative binomial model, that define a collection of models to describe the differences among experimental groups and to compute the posterior probability of each one of them for each gene. Other approaches include data transformation methods that take into account the sampling variance of small read counts and create discrete gene expression distributions that can be analyzed by regular linear models [55]. Finally, non-parametric approaches such as NOISeq [10] or SAMseq [61] make minimal assumptions about the data and estimate the null distribution for inferential analysis from the actual data alone. For small-scale studies that compare two samples with no or few replicates, the estimation of the negative binomial distribution can be noisy. In such cases, simpler methods based on the Poisson distribution, such as DEGseq [62], or on empirical distributions (NOISeq [10]) can be an alternative, although it should be strongly stressed that, in the absence of biological replication, no population inference can be made and hence any p value calculation is invalid. Methods that analyze RNA-seq data without replicates therefore only have exploratory value. Considering the drop in price of sequencing, we recommend that RNA-seq experiments have a minimum of three biological replicates when sample availability is not limiting to allow all of the differential expression methods to leverage reproducibility between replicates.

Recent independent comparison studies have demonstrated that the choice of the method (or even the version of a software package) can markedly affect the outcome of the analysis and that no single method is likely to perform favorably for all datasets [56, 63, 64] (Box 4). We therefore recommend thoroughly documenting the settings and version numbers of programs used and considering the repetition of important analyses using more than one package.

Alternative splicing analysis

Transcript-level differential expression analysis can potentially detect changes in the expression of transcript isoforms from the same gene, and specific algorithms for alternative splicing-focused analysis using RNA-seq have been proposed. These methods fall into two major categories. The first approach integrates isoform expression estimation with the detection of differential expression to reveal changes in the proportion of each isoform within the total gene expression. One such early method, BASIS, used a hierarchical Bayesian model to directly infer differentially expressed transcript isoforms [65]. CuffDiff2 estimates isoform expression first and then compares their differences. By integrating the two steps, the uncertainty in the first step is taken into consideration when performing the statistical analysis to look for differential isoform expression [66]. The flow difference metric (FDM) uses aligned cumulative transcript graphs from mapped exon reads and junction reads to infer isoforms and the Jensen-Shannon divergence to measure the difference [67]. Recently, Shi and Jiang [68] proposed a new method, rSeqDiff, that uses a hierarchical likelihood ratio test to detect differential gene expression without splicing change and differential isoform expression simultaneously. All these approaches are generally hampered by the intrinsic limitations of short-read sequencing for accurate identification at the isoform level, as discussed in the RNA-seq Genome Annotation Assessment Project paper [30].

The so-called ‘exon-based’ approach skips the estimation of isoform expression and detects signals of alternative splicing by comparing the distributions of reads on exons and junctions of the genes between the compared samples. This approach is based on the premise that differences in isoform expression can be tracked in the signals of exons and their junctions. DEXseq [69] and DSGSeq [70] adopt a similar idea to detect differentially spliced genes by testing for significant differences in read counts on exons (and junctions) of the genes. rMATS detects differential usage of exons by comparing exon-inclusion levels defined with junction reads [71]. rDiff detects differential isoform expression by comparing read counts on alternative regions of the gene, either with or without annotated alternative isoforms [72]. DiffSplice uses alignment graphs to identify alternative splicing modules (ASMs) and identifies differential splicing using signals of the ASMs [73]. The advantage of exon or junction methods is their greater accuracy in identifying individual alternative splicing events. Exon-based methods are appropriate if the focus of the study is not on whole isoforms but on the inclusion and exclusion of specific exons and the functional protein domains (or regulatory features, in case of untranslated region exons) that they contain.

Visualization

Visualization of RNA-seq data (Fig. 1c) is, in general terms, similar to that of any other type of genomic sequencing data, and it can be done at the level of reads (using ReadXplorer [74], for example) or at the level of processed coverage (read pileup), unnormalized (for example, total count) or normalized, using genome browsers such as the UCSC browser [75], Integrative Genomics Viewer (IGV) [76] (Figure S1a in Additional file 1), Genome Maps [77], or Savant [78]. Some visualization tools are specifically designed for visualizing multiple RNA-seq samples, such as RNAseqViewer [79], which provides flexible ways to display the read abundances on exons, transcripts and junctions. Introns can be hidden to better display signals on the exons, and the heatmaps can help the visual comparison of signals on multiple samples (Figure S1b, c in Additional file 1). However, RNAseqViewer is slower than IGV.

Some of the software packages for differential gene expression analysis (such as DESeq2 or DEXseq in Bioconductor) have functions to enable the visualization of results, whereas others have been developed for visualization-exclusive purposes, such as CummeRbund (for CuffDiff [66]) or Sashimi plots, which can be used to visualize differentially spliced exons [80]. The advantage of Sashimi plots is that their display of junction reads is more intuitive and aesthetically pleasing when the number of samples is small (Figure S1d in Additional file 1). Sashimi, structure, and hive plots for splicing quantitative trait loci (sQTL) can be obtained using SplicePlot [81]. Splice graphs can be produced using SpliceSeq [82], and SplicingViewer [83] plots splice junctions and alternative splicing events. TraV [84] is a visualization tool that integrates data analysis, but its analytical methods are not applicable to large genomes.

Owing to the complexity of transcriptomes, efficient display of multiple layers of information is still a challenge. All of the tools are evolving rapidly and we can expect more comprehensive tools with desirable features to be available soon. Nevertheless, the existing tools are of great value for exploring results for individual genes of biological interest to assess whether particular analyses’ results can withstand detailed scrutiny or to reveal potential complications caused by artifacts, such as 3’ biases or complicated transcript structures. Users should visualize changes in read coverage for genes that are deemed important or interesting on the basis of their analysis results to evaluate the robustness of their conclusions.

Gene fusion discovery

The discovery of fused genes that can arise from chromosomal rearrangements is analogous to novel isoform discovery, with the added challenge of a much larger search space as we can no longer assume that the transcript segments are co-linear on a single chromosome. Artifacts are common even using state-of-the-art tools, which necessitates post-processing using heuristic filters [85]. Artifacts primarily result from misalignment of read sequences due to polymorphisms, homology, and sequencing errors. Families of homologous genes, and highly polymorphic genes such as the HLA genes, produce reads that cannot be easily mapped uniquely to their location of origin in the reference genome. For genes with very high expression, the small but non-negligible sequencing error rate of RNA-seq will produce reads that map incorrectly to homologous loci. Filtering highly polymorphic genes and pairs of homologous genes is recommended [86, 87]. Also recommended is the filtering of highly expressed genes that are unlikely to be involved in gene fusions, such as ribosomal RNA [86]. Finally, a low ratio of chimeric to wild-type reads in the vicinity of the fusion boundary may indicate spurious mis-mapping of reads from a highly expressed gene (the transcript allele fraction described by Yoshihara et al. [87]).

Given successful prediction of chimeric sequences, the next step is the prioritization of gene fusions that have biological impact over more expected forms of genomic variation. Examples of expected variation include immunoglobulin (IG) rearrangements in tumor samples infiltrated by immune cells, transiently expressed transposons and nuclear mitochondrial DNA, and read-through chimeras produced by co-transcription of adjacent genes [88]. Care must be taken with filtering in order not to lose events of interest. For example, removing all fusions involving an IG gene may remove real IG fusions in lymphomas and other blood disorders; filtering fusions for which both genes are from the IG locus is preferred [88]. Transiently expressed genomic breakpoint sequences that are associated with real gene fusions often overlap transposons; these should be filtered unless they are associated with additional fusion isoforms from the same gene pair [89]. Read-through chimeras are easily identified as predictions involving alternative splicing between adjacent genes. Where possible, fusions should be filtered by their presence in a set of control datasets [87]. When control datasets are not available, artifacts can be identified by their presence in a large number of unrelated datasets, after excluding the possibility that they represent true recurrent fusions [90, 91].

Strong fusion-sequence predictions are characterized by distinct subsequences that each align with high specificity to one of the fused genes. As alignment specificity is highly correlated with sequence length, a strong prediction sequence is longer, with longer subsequences from each gene. Longer reads and larger insert sizes produce longer predicted sequences; thus, we recommend PE RNA-seq data with larger insert size over SE datasets or datasets with short insert size. Another indicator of prediction strength is splicing. For most known fusions, the genomic breakpoint is located in an intron of each gene [92] and the fusion boundary coincides with a splice site within each gene. Furthermore, fusion isoforms generally follow the splicing patterns of wild-type genes. Thus, high confidence predictions have fusion boundaries coincident with exon boundaries and exons matching wild-type exons [91]. Fusion discovery tools often incorporate some of the aforementioned ideas to rank fusion predictions [93, 94], though most studies apply additional custom heuristic filters to produce a list of high-quality fusion candidates [90, 91, 95].

Small RNAs

Next-generation sequencing represents an increasingly popular method to address questions concerning the biological roles of small RNAs (sRNAs). sRNAs are usually 18–34 nucleotides in length, and they include miRNAs, short-interfering RNAs (siRNAs), PIWI-interacting RNAs (piRNAs), and other classes of regulatory molecules. sRNA-seq libraries are rarely sequenced as deeply as regular RNA-seq libraries because of a lack of complexity, with a typical range of 2–10 million reads. Bioinformatics analysis of sRNA-seq data differs from standard RNA-seq protocols (Fig. 1c). Ligated adaptor sequences are first trimmed and the resulting read-length distribution is computed. In animals, there are usually peaks for 22 and 23 nucleotides, whereas in plants there are peaks for 21- and 24-nucleotide redundant reads. For instance, miRTools 2.0 [96], a tool for prediction and profiling of sRNA species, uses by default reads that are 18–30 bases long. The threshold value depends on the application, and in case of miRNAs is usually in the range of 19–25 nucleotides.

As in standard RNA-seq, sRNA reads must then be aligned to a reference genome or transcriptome sequences using standard tools, such as Bowtie2 [97], STAR [15], or Burrows-Wheeler Aligner (BWA) [98]. There are, however, some aligners (such as PatMaN [99] and MicroRazerS [100]) that have been designed to map short sequences with preset parameter value ranges suited for optimal alignment of short reads. The mapping itself may be performed with or without mismatches, the latter being used more commonly. In addition, reads that map beyond a predetermined set number of locations may be removed as putatively originating from repetitive elements. In the case of miRNAs, usually 5–20 distinct mappings per genome are allowed. sRNA reads are then simply counted to obtain expression values. However, users should also verify that their sRNA reads are not significantly contaminated by degraded mRNA, for example, by checking whether a miRNA library shows unexpected read coverage over the body of highly expressed genes such as GAPDH or ACTB.

Further analysis steps include comparison with known sRNAs and de novo identification of sRNAs. There are class-specific tools for this purpose, such as miRDeep [101] and miRDeep-P [102] for animal and plant miRNAs, respectively, or the trans-acting siRNA prediction tool at the UEA sRNA Workbench [103]. Tools such as miRTools 2.0 [96], ShortStack [104], and iMir [105] also exist for comprehensive annotation of sRNA libraries and for identification of diverse classes of sRNAs.

Functional profiling with RNA-seq

The last step in a standard transcriptomics study (Fig. 1b) is often the characterization of the molecular functions or pathways in which differentially expressed genes (DEGs) are involved. The two main approaches to functional characterization that were developed first for microarray technology are (a) comparing a list of DEGs against the rest of the genome for overrepresented functions, and (b) gene set enrichment analysis (GSEA), which is based on ranking the transcriptome according to a measurement of differential expression. RNA-seq biases such as gene length complicate the direct applications of these methods for count data and hence RNA-seq-specific tools have been proposed. For example, GOseq [106] estimates a bias effect (such as gene length) on differential expression results and adapts the traditional hypergeometric statistic used in the functional enrichment test to account for this bias. Similarly, the Gene Set Variation Analysis (GSVA) [107] or SeqGSEA [108] packages also combine splicing and implement enrichment analyses similar to GSEA.

Functional analysis requires the availability of sufficient functional annotation data for the transcriptome under study. Resources such as Gene Ontology [109], Bioconductor [110], DAVID [111, 112] or Babelomics [113] contain annotation data for most model species. However, novel transcripts discovered during de novo transcriptome assembly or reconstruction would lack at least some functional information and therefore annotation is necessary for functional profiling of those results. Protein-coding transcripts can be functionally annotated using orthology by searching for similar sequences in protein databases such as SwissProt [114] and in databases that contain conserved protein domains such as Pfam [115] and InterPro [116]. The use of standard vocabularies such as the Gene Ontology (GO) allows for some exchangeability of functional information across orthologs. Popular tools such as Blast2GO [117] allow massive annotation of complete transcriptome datasets against a variety of databases and controlled vocabularies. Typically, between 50 and 80 % of the transcripts reconstructed from RNA-seq data can be annotated with functional terms in this way. However, RNA-seq data also reveal that an important fraction of the transcriptome is lacking protein-coding potential. The functional annotation of these long non-coding RNAs is more challenging as their conservation is often less pronounced than that of protein-coding genes. The Rfam database [118] contains most well-characterized RNA families, such as ribosomal or transfer RNAs, while mirBase [119] or Miranda [120] are specialized in miRNAs. These resources can be used for similarity-based annotation of short non-coding RNAs, but no standard functional annotation procedures are available yet for other RNA types such as the long non-coding RNAs.