

Note 0: Why the title “Please feed the bears”? A sleuth is a group of bears. Bears eat fish. This coincidence hasn’t escaped us.

Note 1: This post (and the nifty additions to Sleuth contained herein) was written in collaboration with Richard Smith-Unna (who, by the way, I’d like to congratulate again on being selected as one of the 2015 Mozilla Fellows for Science).

Note 2: This post has been updated. It’s no longer necessary to install a custom fork to use sleuth with sailfish and salmon output. Rather, the new wasabi package — discussed below — makes using sailfish and salmon output with the mainline version of sleuth simple!



The quasi-mapping procedure described in my previous blog post (and the initial library-fication thereof) has allowed us to make short work of using quasi-mapping in Sailfish. Quasi-mapping makes Sailfish both faster and more accurate (and makes Salmon faster while retaining its accuracy). Quasi-mapping has been part of the new releases of Sailfish since v0.7.0, but the release of the latest version of Sailfish (v0.7.6) marks the inclusion of a significant new feature — evaluation of the variance of abundance estimation via bootstrapping (Salmon also has this feature as of v0.5.0)1. This makes interfacing Sailfish and Salmon with the new Sleuth tool for differential expression a trivial matter of format conversion; one that we decided was worth tackling.

This is exciting for many reasons. First, it provides more freedom for the user in mixing and matching tools to build the quatification → DE → higher-level analysis pipeline they want. Second, Sleuth is one of the first packages really built specifically for transcript-level quantification tools. It takes advantage of the statistical properties one expects downstream of modern quantification tools as well as the ability to incorporate assessments of (technical) estimation uncertainty when available. Thus, we believe that pairing Sailfish or Salmon with Sleuth will provide a fast, accurate, and easy-to-use pipeline for transcript-level quantification and differential expression testing.

Why would we want to add Sailfish (or Salmon) support to Sleuth? Currently Sleuth only supports Kallisto. While Kallisto is a great tool, there are plenty of reasons to avoid locking in to a particular pipeline. You may already have another tool in your workflow, or you may prefer (or need) a different license, or your computing / resource layout may be better suited to a particular tool, or you may want to validate that your results are robust to a choice of the quantification tool used upstream, or you may even depend on features that only exist in one particular tool. We think the community benefits from access to a choice of tools, and interoperability between those tools. With this blog post we wanted to take a step towards better interoperability.

So, without further ado, let’s get on to how to link the two tools into a pipeline. Demonstrating the the pipeline by way of a tutorial, we’ll analyze this 4-condition experiment [accession PRJDB2508] in A. Thaliana. Further, we’ll be using this reference transcriptome. If you want to skip the quantification step and jump directly to the analysis, you can grab pre-computed quantification results in the appropriate format here. First, we create the index:

> curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz | gunzip -c > artd_ref.fa > sailfish index -t artd_ref.fa -o artd_index_k31 1 2 > curl ftp : //ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz | gunzip -c > artd_ref.fa > sailfish index - t artd_ref . fa - o artd_index_k31

Now, we assume that the input data is in a top-level directory data, with each sub-folder giving the run accession. For example, the paired-end read files for run accession DRR016125 are stored in data/DRR016125/DRR016125_1.fastq.gz and data/DRR016125/DRR016125_2.fastq.gz. This can be done conveniently with the following snippet:

#!/bin/bash mkdir data cd data for i in `seq 25 40`; do mkdir DRR0161${i}; cd DRR0161${i}; wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz; wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz; cd ..; done cd .. 1 2 3 4 5 6 7 8 9 10 11 12 #!/bin/bash mkdir data cd data for i in ` seq 25 40 ` ; do mkdir DRR0161 $ { i } ; cd DRR0161 $ { i } ; wget ftp : //ftp.sra.ebi.ac.uk/vol1/fastq/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz; wget ftp : //ftp.sra.ebi.ac.uk/vol1/fastq/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz; cd . . ; done cd . .

We make a top-level directory for all of the output:

> mkdir quants 1 > mkdir quants

Next, we want to quantify all of the samples with Sailfish, and generate bootstrap estimates for each sample. The following should do the trick2:

#!/bin/bash for fn in data/DRR0161{25..40}; do samp=`basename ${fn}` echo "Processing sample ${samp}" sailfish quant -i artd_index_k31 -l IU \ -1 <(gunzip -c ${fn}/${samp}_1.fastq.gz) \ -2 <(gunzip -c ${fn}/${samp}_2.fastq.gz) \ -o quants/${samp}_quant --useVBOpt --numBootstraps 30 done 1 2 3 4 5 6 7 8 9 10 #!/bin/bash for fn in data / DRR0161 { 25..40 } ; do samp = ` basename $ { fn } ` echo "Processing sample ${samp}" sailfish quant - i artd_index_k31 - l IU \ - 1 < ( gunzip - c $ { fn } / $ { samp } _1 . fastq . gz ) \ - 2 < ( gunzip - c $ { fn } / $ { samp } _2 . fastq . gz ) \ - o quants / $ { samp } _quant -- useVBOpt -- numBootstraps 30 done

We’re opting here to use the VB optimizer (–useVBOpt) rather than the standard EM optimizer. The VB optimizer generally tends to give more accurate results than the standard EM algorithm. Bootstrapping, however, works equally well with both. Now, if we have a look in the quants directory, we’ll see that there is a subdirectory for every input sample. Each quant subdirectory has the following files / subdirectories: logs, quant.sf, aux. The first subdirectory simply contains the log file. The second file has the normal sailfish quantification estimates. The third directory contains the new stuff … including the bootstrap estimates (in a “bootstrap” subdirectory under “aux”). The bootstrap subdirectory contains two files — names.tsv.gz and bootstraps.gz. The first file simply contains the names of the transcripts / contigs present in the bootstrap file (in the order they appear on each line), and the second file contains the actual bootstrap estimates. To keep sizes small, the bootstrap estimates are written in a binary format, and then compressed during output. However, the format itself is very simple. The gzipped, binary file consists of num transcripts x num bootstrap double-precision floating point numbers. Each contiguous block of num transcript numbers constitutes a single bootstrap sample (i.e. an estimated read count for each transcript), and there are num bootstrap such blocks. That is, you can think of this as a row-oriented file, where each row is a bootstrap sample, or a column-oriented file where each column is all of the bootstrap samples for a single transcript. You can use this handy little script if you want to convert the bootstrap estimates into a text-based format for easy inspection.

To work with this output, you’ll need to install our modified fork of Sleuth, which has built-in functionality for converting the Sailfish-format output to the input format required by Sleuth3 (Here’s a nice description of the hdf5 format that is processed by Sleuth, and the information it should contain.); but don’t worry, that’s easy! We’ve just made a pull-request to have these changes merged in upstream, but they’re not there yet.

Now that we have our quantification results prepared, we’re ready to load up our data and do some analysis. Using the new wasabi package, this is simple! Ok, so let’s fire-up R and kick some RNA-seq analysis tires:

source("http://bioconductor.org/biocLite.R") biocLite("devtools") # only if devtools not yet installed devtools::install_github("pachterlab/sleuth") devtools::install_github("COMBINE-lab/wasabi") library('seluth') library('wasabi') setwd('quants') sf_dirs <- dir() # convert the sailfish dirs to kallisto h5 format prepare_fish_for_sleuth(sf_dirs) # we have 4 conditions: wild type, each gene knocked out individually, # and both kocked out together genotypes <- c( rep('wildtype', 4), rep('ros1-3', 4), rep('dml2;3', 4), rep('ros1dml2;3', 4) ) # describe the experiment in terms of genotype and growth condition # we also summarise the gene knockouts and osmotic stress status # or each sample s2c <- data.frame( sample = sf_dirs, genotype = genotypes, growth = rep(c(':mock', ':aba', ':saline', ':dry'), 4), ros1_3 = rep(c(rep(':on', 4), rep(':off', 4)), 2), dml2_3 = c(rep(':on', 8), rep(':off', 8)), osmotic_stress = rep(c(':no', ':no', ':yes', ':yes'), 4), path = sf_dirs ) s2c$path <- as.character(s2c$path) s2c # model the experimental question model <- "~ osmotic_stress" # load data and fit the model so <- sleuth_prep(s2c, as.formula(model)) %>% sleuth_fit() models(so) # test so <- sleuth_wt(so, which_beta = 'osmotic_stress:yes') # inspect the results sleuth_live(so) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 source ( "http://bioconductor.org/biocLite.R" ) biocLite ( "devtools" ) # only if devtools not yet installed devtools :: install_github ( "pachterlab/sleuth" ) devtools :: install_github ( "COMBINE-lab/wasabi" ) library ( 'seluth' ) library ( 'wasabi' ) setwd ( 'quants' ) sf_dirs < - dir ( ) # convert the sailfish dirs to kallisto h5 format prepare_fish_for_sleuth ( sf_dirs ) # we have 4 conditions: wild type, each gene knocked out individually, # and both kocked out together genotypes < - c ( rep ( 'wildtype' , 4 ) , rep ( 'ros1-3' , 4 ) , rep ( 'dml2;3' , 4 ) , rep ( 'ros1dml2;3' , 4 ) ) # describe the experiment in terms of genotype and growth condition # we also summarise the gene knockouts and osmotic stress status # or each sample s2c < - data . frame ( sample = sf_dirs , genotype = genotypes , growth = rep ( c ( ':mock' , ':aba' , ':saline' , ':dry' ) , 4 ) , ros1_3 = rep ( c ( rep ( ':on' , 4 ) , rep ( ':off' , 4 ) ) , 2 ) , dml2_3 = c ( rep ( ':on' , 8 ) , rep ( ':off' , 8 ) ) , osmotic_stress = rep ( c ( ':no' , ':no' , ':yes' , ':yes' ) , 4 ) , path = sf _ dirs ) s2c $ path < - as . character ( s2c $ path ) s2c # model the experimental question model < - "~ osmotic_stress" # load data and fit the model so < - sleuth_prep ( s2c , as . formula ( model ) ) % > % sleuth_fit ( ) models ( so ) # test so < - sleuth_wt ( so , which_beta = 'osmotic_stress:yes' ) # inspect the results sleuth_live ( so )

Awesome! If we jump into the browser window that connects to the Shiny app, and look under maps → PCA, we can see that the samples separate nicely by osmotic stress:

If we color by growth instead, and plot PC dimensions 2 vs. 3, we get 4 nice clusters, as we’d expect:

If we look at analyses → test table, we can see the most highly DE transcripts:

The top DE transcript is AT1G32900.1 (GBSS1, GRANULE BOUND STARCH SYNTHASE 1), which makes sense given that we know that osmotic stress induces carbohydrate storage. We can scoot over to analyses → transcript view, and look at how this transcript is expressed across samples, colored by osmotic stress:

The transcript is well-expressed in those conditions under osmotic stress (“yes”) and absent or at low expression in those conditions not under osmotic stress (“no”), as we would expect.