Over 30 years ago subsections of NCBI, EMBL-EBI, and DDBJ came together to form the The International Nucleotide Sequence Database Collaboration (INSDC). (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013722/) The INSDC joined NCBI’s Genbank, EMB-EBI’s ENA, and DDBJ’s DRA together into a unified group “to ensure that all public domain nucleotide sequence data deposited in the archives is preserved as part of the scientific record and is accessible in standardized formats across the three sites through daily data exchange.” (source: https://academic.oup.com/nar/article/46/D1/D48/4668651). The INSDC has two primary offerings: “Raw data archives under the collaboration are known as the Trace Archive for raw data from capillary electrophoresis platforms and the Sequence Read Archive for raw and read alignment data from next-generation platforms.” (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013722/).

You may have noticed that both NCBI and INSDC both have an offering called SRA and you may be wondering if these are the same thing. As far as I can tell, the answer is both yes and no. The INSDC’s SRA is a special database that is co-managed by NCBI, EBI, and DDBJ. Each of those organizations host a full copy of the database. EBI’s copy of the database is called ENA, DDBJ’s copy of the database is called DRA, and NCBI’s copy of the database is confusingly called SRA. Each of these organizations also can accept new submissions to the database.

If an experiment is first submitted to ENA, it and associated data objects will be prefixed with `ER` such as ERP008771, ERX1762259, or ERR1692631. If an experiment is first submitted to DRA, it and associated data objects will instead be prefixed with `DR` such as DRP000425, DRX000772, or DRR001175. If an experiment is first submitted to NCBI’s SRA, it and associated data objects will be prefixed with `SR` such as SRP060416, SRX1082691, or SRR2088722. (I’d like to have links for each of these objects so as to show them off.) If you’re asking what the differences between ERP/DRP/SRP, ERX/DRX/SRX, and ERR/DRR/SRR data objects are then stay tuned. I’ll cover SRA’s data model in the next section.

Fun Fact: According to the SRA wikipedia page it used to stand for Short Read Archive, so if you ever see that it’s not entirely wrong.

Bonus: While researching this post, I came across China’s Genome Sequence Archive (GSA) which adheres “with data standards and structures of the INSDC” (source: https://www.sciencedirect.com/science/article/pii/S1672022917300025). However they do not appear to be replicating and contributing to the shared SRA repository and instead are maintaining their own collection. Downloading, processing, and serving data from GSA has already been added to our future plans, especially given the growth in Chinese investment in basic science.

Microarray Repositories