Datasets

A key aspect of our proposed assessment is the availability of two datasets that permit the computation of the metrics. They are characterized by including two populations, at least two replicates for each population, and a way to define beforehand which genes or transcripts are differentially expressed and which are not. The replicates permit the assessment of precision and comparing the two populations permits the assessment of sensitivity or the ability to discover real biological differences. Note that sensitivity is harder to assess because we need to know beforehand what differences to expect. In the past, this has been achieved with spike-in experiments [18–20]. However, the use of spike-ins has been criticized for not properly mimicking real experimental data [21].

The first represents the minimal dataset that can be used for the comparison. It includes two replicates for the cell lines GM12878 and K562. We used results from a microarray experiment comparing the same two cell lines to define real biological differences. We defined genes with a q-value smaller than 0.05 [22, 23] and absolute log fold changes larger than 1 as truly differentially expressed. Genes with q-values larger than 0.5 were denoted as not differentially expressed. Genes in neither of these two groups were considered to be in a “grey area” and left out of the analysis. Note that we are not considering microarrays to be a gold standard, but because the microarray data represents an independent measurement, algorithms that perform better at detecting real differences should, on average, show improved agreement with these independent results.

The second dataset was created using 30 samples from the Geuvadis project [24]. These samples were selected to represent a random sample of individuals. To introduce batch effect-like variability, we selected 15 from one center and 15 from another. These were then randomly divided into two groups both having seven samples from one center and eight from another. Because the samples were assigned at random, this is a null experiment and we can consider the 15 samples in each group to be replicates. To distinguish the two groups, we used computer simulations to generate 2424 transcripts designed to be differentially expressed between the two groups. To make these abundances mimic experimental data, we adapted the Polyester method [25] to include GC bias imitating the bias observed in the actual data. The resulting dataset mimics a real one quite well (Fig. 1).

Fig. 1 Estimated log fold changes stratified by transcript abundance on simulation dataset. One example based on Cufflinks quantification of two samples is shown here. Black points are non-differential transcripts; blue points are differentially expressed transcripts which were simulated to have signals on both samples; red points are differentially expressed transcripts which were simulated to have signals in only one of the samples Full size image

The raw sequencing files for both datasets are available from the webtool (http://rafalab.rc.fas.harvard.edu/rnaseqbenchmark). Further details about both datasets are available in the “Methods” section.

Calibration using control genes

The first challenge to comparing performance across the different approaches is the lack of a standard unit for transcript level quantification. For example, Cufflinks reports fragments per kilobase of exon per million fragments mapped (FPKM); Flux Capacitor reports reads per kilobase of exon per million reads mapped (RPKM); eXpress, RSEM, Sailfish, kallisto, and Salmon report transcripts per million (TPM). Note that some of these algorithms provide options for what unit to return and we allowed each laboratory to decide which unit to report. Other analytical choices, such as the choice of normalization approach, add even more variability to the scales of the reported measures (Fig. 2a). The standard solutions to rescaling—for example, dividing by the median—are not appropriate because the median value for a typical sample is 0. Taking the median of the positive values is not appropriate as it may introduce a bias in favor of or against methods that over- or underestimate the number of features with no expression. To overcome these challenges we considered only genes reported to be house-keeping genes [26], which are more likely to be expressed. Specifically, considering only this subset of genes for each algorithm, we compute the median. Because house-keeping genes are typically expressed, this median value will not be 0. We then select one of the algorithms to serve as a reference baseline (we used RSEM, which reported in TPM) and we rescale all methods in the log-scale so that a value of 0 in the log-scale maps to a TPM of 1. Figure 2 shows the data before and after this rescaling. Note that this is not meant as a normalization step, but rather as a simple rescaling so that the reported quantifications are approximately in the same units for all methods. Furthermore, note that fold change values are not affected by this re-scaling since the measurements for each algorithm are divided by the same constant which is cancelled out in fold-change calculations.

Fig. 2 Distribution of reported transcript quantifications on one sample of simulation dataset a before and b after rescaling. Seven quantification methods are shown here Full size image

Correlation is not a measure of precision or reproducibility

The use of correlation to summarize reproducibility has become widespread in genomics [18, 27, 28]. Despite its English language definition, mathematically, correlation is not necessarily informative with regards to reproducibility. For this reason, we dedicate a subsection to explain the major problems with this metric (details are provided in the “Methods” section).

The most egregious related mistake is to compute correlations of raw FPKM, RPKM, or TPM data. Averages, standard deviations, and correlations are popular summary statistics for two-dimensional data because, for the bivariate normal distribution, these five parameters fully describe the distribution [29]. However, RNA-seq data are not well approximated by bivariate normal data (Fig. 2). In fact, these data have a very large right tail, which implies that the correlation estimate can be highly susceptible to one point (Additional file 1: Figure S1a). Using the log transformation is a way to ameliorate this problem.

The standard way to quantify reproducibility between two sets of replicated measurements, say X 1 … X N and Y 1 … Y N , is simply to determine how close they are to each other. To quantify distance, we compute the mathematical distance between them:

$$ \sqrt{\varSigma_{i=1}^N\kern0.5em {d}_i^2,}\kern2em \mathrm{with}\kern0.5em {d}_i={X}_i-{Y}_i $$

This metric decreases as reproducibility improves and is 0 when the reproducibility is perfect.

Another limitation with correlation is that it is defined from two lists and there is no standard way of applying it when more than one replicate is available. A standard measure of precision, the standard deviation (SD) across replicates, is more appropriate. Note that there is a connection between this metric and distance since the SD for two replicates is:

$$ \sqrt{\frac{1}{2}\left\{{\left({X}_i\kern0.5em -\kern0.5em \frac{X_i\kern0.5em +\kern0.5em {Y}_i}{2}\right)}^2\kern0.5em +\kern0.5em {\left({Y}_i\kern0.5em -\kern0.5em \frac{X_i\kern0.5em +\kern0.5em {Y}_i}{2}\right)}^2\right\}}\kern0.5em =\kern0.5em \frac{X_i\kern0.5em -\kern0.5em {Y}_i}{2}\kern0.5em =\kern0.5em \frac{d_i}{2} $$

Another limitation of the correlation is that it does not detect cases that are not reproducible due to average changes. These could happen, for example, if the data are not properly normalized. The distance metric does detect these differences (Additional file 1: Figs. S1b and S2). The mathematical details are provided in the “Methods”.

Another advantage of the SD metric is that it is in the same units as our measurements. Correlation lacks units and thus renders the metric hard to interpret. In the “Methods” section, we mathematically demonstrate that correlations near 1 do not necessarily imply reproducibility. Specifically, we show an equation explaining how we may encounter situations in which the distance between two measures is unacceptably high yet correlations close to 1 are achieved.

Precision metrics

Once the raw data are mapped, quantified, and re-scaled, a matrix with one column for each replicate and thousands of rows is produced for each group. The entries of this matrix are what the algorithms provide as a measure that is proportional to expression. Here we will denote this quantity with \( {Y}_{\mathit{\mathsf{g}}ij} \) (where \( \mathit{\mathsf{g}}\kern0.5em =\kern0.5em 1\kern0.5em \dots \kern0.5em G \)) being the feature (gene or transcript), i = 1, 2 identifying the two groups and j = 1 … J i representing the replicate within group i.

Our first metric is based on the standard deviations, denoted s gi , of \( \log \left({Y}_{gi1}+0.5\right),\dots, \log \left({Y}_{gi{J}_i}+0.5\right) \). This metric has an intuitive interpretation as it represents the typical log fold change observed when comparing expression values from replicate samples. We compute the SD on the log-scale because biologists quantify differential expression with fold changes. Because the log is not defined when the Y gij values are 0, we add the constant 0.5 [30] before computing the log. In the case of two replicates, the SD would be proportional to the absolute value of the log ratios:

M gi = log 2 {(Y gi1 + 0.5)/(Y gi2 + 0.5)}

Note that we have an s gi for every transcript and each group. To provide one summary, we can average across all the features to obtain one measure of reproducibility:

$$ \sqrt{\frac{1}{2}\left(\frac{1}{G}{\displaystyle \sum_{g=1}^G}{s}_{g1}^2+\frac{1}{G}{\displaystyle \sum_{g=1}^G}{s}_{g2}^2\right)} $$

For the case of two replicates this is proportional to Euclidean distance. The smaller this quantity, the more reproducible we assess the algorithm to be. However, as we describe below, providing just one summary is simplistic due to the dependence of variability on abundance.

Empirical and theoretical evidence suggests that the range of s gi is larger for lower abundance transcript measurements; thus, visualization approaches plot s i versus a measure of average abundance:

$$ {A}_{gi}=1/{J}_i{\displaystyle \sum_{j=1}^{J_i}}\left({ \log}_2{Y}_{gij}+0.5\right) $$

Additional file 1: Figure S3a confirms that larger variability is observed for smaller values of A. Consider that we may prefer a method that outperforms for the values of A that are most common (for example, note that less than 45 % of the data has A values larger than 2). Now, this plot does not lend itself to visualizations that permit comparisons across methods, as each method needs its own plot. To create a reasonable summary plot that includes all the methods, we simplify the relevant information by estimating s i as a function of A i [31]. Specifically, we apply loess [32] to estimate this function. We can then plot several curves on the same plot to compare methods (Fig. 3). To provide summary statistics that take into account the dependence on abundance, we report the median s i for low (A lower than 1), medium (A between 1 and 6), and high (A larger than 6) strata (columns 2–4 in Tables 1 and 2). We include standard error estimates of these metrics as well.

Fig. 3 Standard deviations of transcript quantifications based on a an experimental dataset (GM12878) and b a simulation dataset (one of the cell lines). Seven quantification methods are shown here Full size image

Table 1 Summarized metrics for analyzed pipelines based on an experimental dataset Full size table

Table 2 Summarized metrics for analyzed pipelines based on a simulation dataset Full size table

In the first dataset, Flux Capacitor and eXpress clearly underperform compared with the other methods in the regions with most data (A between 3 and 8). In the second dataset, only Flux Capacitor clearly underperforms. The other methods performed similarly, with RSEM performing slightly better in both datasets. Overall, the precision was substantially worse than what we observe for microarrays (compare, for example, with Fig. 2 in [31]). This is particularly true for low abundance transcripts where even the best methods show a standard deviation of 0.5, which translates to a difference of 41 % between replicate measurements.

Although we observed differences between quantification algorithms, different aligners show similar results (Additional file 1: Figure S4). STAR generally outperformed TopHat2, although very marginally. Also, RSEM mapped with Bowtie2 outperformed RSEM with STAR (Additional file 1: Figure S4a).

Because a large percentage of the quantifications are 0, we also developed summary statistics and visualization techniques to assess the across-replicate consistency of 0 calls. Note that the features considered here are those with at least one 0 in the pair of replicated measures. For each of these we report the proportion of discordant calls:

$$ {D}_i(K)= \Pr \left[\left({Y}_{gi1}<K\ \&{Y}_{gi2}\ge K\right)\mathrm{or}\ \left({Y}_{gi1}\ge K\ \&\ {Y}_{gi2}<K\right)\right] $$

where K is a threshold defining a transcript as expressed. Because methods that call more 0s are less likely to be discordant, we plot D i (K) against the proportion of the transcript (Fig. 4). The results here are similar to those of Fig. 3. We report D i (K) for K = 1 in column 5 of Tables 1 and 2. In both datasets we see Flux Capacitor as clearly underperforming compared with the other methods.

Fig. 4 Proportions of discordant expression calls based on a an experimental dataset (GM12878) and b a simulation dataset (one of the cell lines). Seven quantification methods are shown here Full size image

Consistency of isoform calls

Because RNA-seq is commonly used to infer alternative transcription, we also assessed the reproducibility of abundance within transcripts of the same gene. To provide a simple and interpretable metric we considered only the genes with exactly two transcripts. Specifically, for each sample we computed the percentage of reported abundances for each of the first transcripts. So if t 1 and t 2 are the reported abundances for transcripts 1 and 2, we compute t 1 /(t 1 + t 2 ). We then performed every pairwise comparison of two replicates and for each gene recorded the difference in these proportions. Note that we expect this difference to be 0 since the same transcripts should be reported in two replicates. We plotted these differences against abundance A gij since we expected larger differences at lower abundances (Additional file 1: Figure S3b). We then stratified the absolute differences by values of A gij and computed the median value. This permitted us to compare curves across methods (Fig. 5). Here we found Flux Capacitor to underperform compared with the other methods, which performed similarly. However, it is worth noting how high these values are, especially for low abundance transcripts where the median differences are close to 0.5, meaning that we are basically guessing which transcript is present.

Fig. 5 Proportion differences of transcript quantifications in genes with only two annotated transcripts based on a an experimental dataset (GM12878) and b a simulation dataset (one of the cell lines). Seven quantification methods are shown Full size image

Sensitivity

The above-described metrics relate to reproducibility or specificity. But given the specificity and sensitivity tradeoff, it is imperative that we also assess sensitivity. For example, a method that calls every gene expressed at 10 TPM has perfect specificity, but would, of course, fail to detect any real differences. Recall that for the first dataset we defined truly differentially expressed genes, not transcripts. For each algorithm, therefore, one measure was constructed for each gene by combining the reported quantities for each transcript using the aggregation method recommended by said algorithm. For the second dataset, defining transcripts that were truly differentially expressed was known by construction.

To assess accuracy, we computed an average log fold change for each of the truly differentially expressed transcripts:

$$ {M}_g=1/{J}_2{\displaystyle \sum_{j=1}^{J_2}}\left({ \log}_2{Y}_{g2j}+0.5\right)-1/{J}_1{\displaystyle \sum_{j=1}^{J_1}}\left({ \log}_2{Y}_{g1j}+0.5\right)\kern0.5em , $$

multiplied it by the sign of the true fold change (so that all true fold changes could be considered positive), and plotted it against the average abundance (Additional file 1: Figure S3c):

$$ {A}_g=1/2{J}_2{\displaystyle \sum_{j=1}^{J_2}}\left({ \log}_2{Y}_{g2j}+0.5\right)+1/2{J}_1{\displaystyle \sum_{j=1}^{J_1}}\left({ \log}_2{Y}_{g1j}+0.5\right) $$

To be able to compare methods, we fitted loess curves to these plots and show the curves for all methods. Note that for the second dataset these curves should be equal to 1 for all values of A since all truly differentially expressed transcripts were designed to have true log (base 2) fold changes of 1 (Additional file 1: Figure S5). Here we can see that, with the exception of the underperformance of Flux Capacitor, all methods perform similarly. As we did for the standard deviation metric, we report the median sensitivity measure for three strata of abundance (columns 10–12 in Tables 1 and 2).

ROC curves and pAUC

Finally, to assess sensitivity and specificity simultaneously, we constructed receiver operating characteristic (ROC) curves. Because in the first dataset we used genes and in the second we used transcripts, here we use the general term feature to refer to both. We used the same approach as in the previous section to define positives (truly differentially expressed features) and negatives (not differentially expressed features). We then obtained log fold change values for every feature across every pairwise comparison between the two groups. Following common practice, we removed all features with both values below 1. Each of these resulted in one ROC curve. We averaged these results using threshold averaging based on fold changes [33] to produce one ROC curve for each method (Fig. 6). The ROC curves only include false positive rates (FPRs) below 0.2 because, in practice, it would be rare to accept a FPR higher than this since a FPR of 0.2 already represents thousands of false positive features. Here we see Flux Capacitor and eXpress underperforming and RSEM slightly outperforming the other methods in both datasets. The partial (up to FPR of 0.2) area under the curve (pAUC) is included in column 13 of Tables 1 and 2, which is the standardized area under the curve [34].

Fig. 6 ROC curves indicating performance of quantification methods based on differential expression analysis of a an experimental dataset and b a simulation dataset. Seven quantification methods are shown. FP false positive, TP true positive Full size image

Although the results for genes are comparable to those seen for microarrays (see Fig. 4 in [8]), we note that the results for transcripts are not impressive in general. For example, to recover half of the real differences, we need to accept a FPR of over 0.15. In fact, to achieve even these results, we removed all transcripts for which both samples were reporting quantities below 1. Without this filtering, the technology does not perform much better than guessing (Additional file 1: Figure S6). This result is in agreement with a recent publication describing the importance of filtering low abundance values in RNA-seq data [35].