Growth of SRA data (http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png)

Get the right version of the software and configure it





EDIT: if you are seeing an error like this one:

/data/app/sratoolkit.2.4.3-ubuntu64/bin/fastq-dump --split-files -A ERR366438 When downloading, make sure you download the newest version from the NCBI website ( link ). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin) there will be a file called sratoolkit.jar. In linux use "java -jar sratoolkit" to open the graphical interface. in the preferences menu, enable the local repository and select a path for it. By doing this, you can then use sra-toolkit to "stream" fastq data (see below).EDIT: if you are seeing an error like this one:

2015-02-15T21:47:01 fastq-dump.2.4.3 err: binary large object corrupt while reading binary large object within virtual database module - failed ERR366438

============================================================= An error occurred during processing. A report was generated into the file '/data/home/ncbi_error_report.xml'. If the problem persists, you may consider sending the file to 'sra@ncbi.nlm.nih.gov' for assistance. =============================================================



Then grab the new sra-toolkit version 2.4.4 which seems to fix problems with SRA archives using reference based compression (when submitters provide data in aligned bam format).



Try streaming the data

You can convert sra to fastq on-the-fly by doing either of the following:





fastq-dump -A SRR1722641 -O SRR1722641.fastq



fastq-dump -A SRR900186 -Z > SRR900186.fastq





Streaming paired-end data could be problematic. Use the following to save forward and reverse reads as separate files. Thanks to the folks at Biostars for this idea.





SRR=SRR1041311 ; fastq-dump -X 10 --split-files -I -Z $SRR \

| tee >(grep '@.*\.1\s' -A3 --no-group-separator \

> ${SRR}_1.fastq) >(grep '@.*\.2\s' -A3 --no-group-separator \

> ${SRR}_2.fastq) >/dev/null





Use download accelerator

The SRA team actually recommend using Aspera connect to speed up the download of SRA files. If the stream isn't working for you, give Aspera a try using this script . If you struggle to get Aspera configured, you can try a download accelerator such as axel or aria2c . Here's an example with axel.



axel -n5 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX709/SRX709649/SRR1585277/SRR1585277.sra





After downloading the SRA archive, dump the fastq:



fastq-dump -A SRR900186.sra -Z --split-files





Via the browser

Here are two useful approaches suggested by SeqAnswers . You can download each fastq.gz file individually from your web-browser (not command line interface) replacing the digits after SSR in this link:





http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR515925&format=fastq



or batch download with a link like:



http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR294514,SRR352727,SRR364895&format=fastq





Alternatively find your study accession number (ie. SRP013698) and go to the SRA run selector:





http://trace.ncbi.nlm.nih.gov/Traces/study/?go=home





Search with your SRP number, then click on the "Run" link. Click on the "Reads" tab, then click "Filtered Download", change the format to "FASTQ" and hit "Download".





SRA mirrors

Stream directly into your analysis pipeline

You can send the data straight through your QC and alignment pipeline without saving intermediate files. Here is an example using SRA toolkit for Olego alignment:



fastq-dump -A SRR764858 -Z \

| fastq_quality_trimmer -l 25 -t 20 -Q33 \

| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \

| samtools view -uSh - \

| samtools sort - SRR764858_sra.sort





And another using curl from EBI:



curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/SRR764858/SRR764858.fastq.gz \

| pigz -d | fastq_quality_trimmer -t 20 -l 25 -Q33 \

| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \

| samtools view -uSh - \

| samtools sort - SRR764858_ebi.sort





Dump color-space sequence

abi-dump -A SRR1657115.sra

solid-trimmer.py -c SRR1657115.sra.csfast -q SRR1657115.sra.qual --moving-average 7:12 --min-read-length 25 > SRR1657115.fasta

The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software ( sra-toolkit ) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.Most of the data on SRA is mirrorred at ENA or DNAnexus You can download the compressed fastq files from ENA for forward and reverse readsftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_1.fastq.gzftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_2.fastq.gzYou can download the SRA archive from DNAnexus too.ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR504/SRR504687/SRR504687.sraOccasionally you'll come across data in color-space format. After downloading the SRA archive do the following.That will dump the sequence in fasta format (SRR1657115.sra.csfasta) and the quality string (SRR1657115.sra.qual) in separate files. Then I use solid-trimmer.py to do quality trimming. Here's an example:Happy data mining!