Our paper was just published describing a new method for modeling and correcting fragment sequence bias for estimation of transcript abundances from RNA-seq:

“Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation”

Q: What’s the point of the paper?

GC content bias – affecting the amplification of fragments – is widespread in sequencing datasets and well-known, yet there were no existing transcript abundance estimation methods that properly corrected for this sample-specific bias. We identified hundreds of cases where top methods mis-identified the dominant isoform due to fragment-level GC bias being left uncorrected. While there are existing methods for post-hoc correction of gene abundance estimates using gene-level GC content, the transcript-level bias correction task is much more difficult. Sample-specific technical variation in coverage on small regions of transcripts leads to dramatic shifts in abundance estimates.

Q: What do you mean by “systematic errors”?

You can split any errors into systematic and stochastic components, with the former a fixed quantity and the latter varying but with zero mean. With the measurement of transcript expression, the stochastic part can be minimized by, for example, increasing sequencing depth and increasing the sample size: the number of biological units measured to infer a population mean expression value. We used the term “systematic” in the title to underscore that higher sequencing depth and more samples will not help to remove the bias we describe.

Q: Is there a video where you explain the gist of this?

Yes. Below is a 5 minute video where I cover the basic ideas presented in the paper.

Q: When should I correct for fragment GC bias? Are all my results wrong?

There are two situations where it’s critical to correct for GC bias for transcript-level analysis:

(1) The most problematic situation is when one is comparing abundances across groups of samples, and the groups have variable GC content dependence. This often happens when samples are processed in different labs or at different times (ideally experiments should not be designed this way, but it is nevertheless common in genomics research). We demonstrate that this can lead to thousands of false positives results for differential expression of transcripts, and these differences will often rise to the very top of a ranked list. The solutions are to use methods which produce GC-bias-corrected transcript abundance estimates, to use block experimental design, to include experimental batch as a covariate in statistical analysis, and to examine GC content dependence across samples using software like FASTQC and MultiQC.

(2) The second point of warning is when any of the samples in the dataset contain strong GC dependence, that is, the library preparation was not able to adequately amplify fragments with GC content < 35% or > 65%*. Such samples were present in a subset of the batches of all the datasets we examined in the paper. Even if all of the samples come from a single batch, or the experiment has a block design with multiple batches, simple descriptive analysis of transcript abundance (e.g. how many isoforms are expressed per gene, which isoforms are expressed) will be inaccurate for hundreds-to-thousands of genes when ignoring fragment GC bias. Furthermore, differential expression, if present, could be attributed to the wrong isoform or isoforms within genes.

* These guidelines come from the Discussion section of an excellent paper on technical reproducibility of RNA-seq by the GEUVADIS project: ‘t Hoen et al (2013).

Q: Where are the details of the alpine statistical method?

The statistical method is detailed in the Supplementary Note. This allowed us to have more fine grained control of the presentation of LaTeX equations. The methods in the Supplementary Note were originally in the main text and were refined during peer review.

Q: Is alpine the only method implementing fragment GC bias correction?

No, the latest version of Salmon (v0.7, with methods described in the bioRxiv preprint) also implements a fragment GC bias correction similar to alpine, which runs at a fraction of the time. Salmon with the gcBias flag similarly reduces the mis-estimation of isoforms caused by fragment GC bias (see examples in Supp. Figure 5 of the latest Salmon preprint).

Q: Is the Salmon implementation of GC bias correction better than alpine?

The GC bias correction model is probably about equal, but other aspects of the Salmon method are superior to alpine. The alpine method, when estimating transcript abundances, focuses on one gene or locus at a time, and does not account for multi-mapping fragments across genes. The Salmon implementation is a full, transcriptome-wide estimation method which can simultaneously estimate and correct for sample-specific fragment GC bias (and other biases as well).

Our focus in writing alpine was to make a super-extensible method for modeling RNA-seq biases, in order to make rigorous comparisons of various methods for bias correction, and then to show that fragment GC bias correction in particular leads to more accurate estimates of transcript abundance. By super-extensible, I mean that one can easily modify details of the bias correction specification: any combination of bias effects (fragment length distribution, positional, random hexamer sequence bias, GC bias), interactions between these, or modifications to the spline parameters used to fit positional and fragment GC bias. It is also possible to take empirical fragment GC bias curves from alpine and directly incorporate these into the Polyester RNA-seq simulator. This results in simulated RNA-seq data with variable coverage that looks more similar to real RNA-seq coverage.

Links to papers mentioned in video:

Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias (2011)

https://www.ncbi.nlm.nih.gov/pubmed/21410973

https://www.ncbi.nlm.nih.gov/pubmed/21410973 Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing (2012)

https://www.ncbi.nlm.nih.gov/pubmed/22323520