A collection of open datasets related to various streams like Bio informatics, Molecular biology etc.

Will keep adding as we come across more datasets. Please feel free to add the datasets you want us to add in the comments and we will include them.

The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied.

The project has a Deep Catalog of Human Genetic Variation. As a community resource project, the 1000 Genomes Project publicly releases data on a regular basis. Data formats and analysis software developed by the Project are also made publicly available.

The gene association files submitted by GO Consortium members. Files are in the GO annotation file format and are compressed using the UNIX gzip utility. Please see the appropriate README file for further details on the annotation set.

UCI machine learning molecular Biology Data Set was used by Ning Qian and Terry Sejnowski in their study using a neural net to predict the secondary structure of certain globular proteins. The idea is to take a linear sequence of amino acids and to predict, for each of these amino acids, what secondary structure it is a part of within the protein. The data set contains both a large set of training data and a distinct set of data that can be used for testing the resulting network.

The Catalogue of Life is a comprehensive and authoritative global index of species currently available. It consists of a single integrated species checklist and taxonomic hierarchy. The Catalogue holds essential information on the names, relationships and distributions of over 1.6 million species. This figure continues to rise as information is compiled from diverse sources around the world.

The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.

The site hosts experimental data sets that will be valuable for testing computational models of the brain and new analysis methods. The data include physiological recordings from sensory and memory systems, as well as eye movement data.

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.

Broad Institute’s rich set of datasets on Bioinformatics & Computational Biology.