Top 5 in Bioinformatics

I recently applied for a Moore Foundation grant in Data Science for the biological sciences. As part of the pre-application, I was asked to choose the top 5 works in data science in my field. Not so sure about data science, so I picked what I think are the most influential works in Bioinformatics, which is what my proposal was about. Anyhow, the choice was tough, and I came up with the following. The order in which I list the works is chronological, as I make no attempt to rank them. If you ask me in the comments “How could you choose X over Y?” my reply would probably be: “I didn’t”.

Dayhoff, M.O., Eck RV, and Eck CM. 1972. A model of evolutionary change in proteins. Pp. 89-99 in Atlas of protein sequence and structure, vol. 5, National Biomedical Research Foundation, Washington D.C

Summary: this is the introduction of the PAM matrix, the paper that set the stage for our understanding of molecular evolution at the protein level, sequence alignment, and the BLASTing we all do. The question the asked: how can we quantify the changes between protein sequences? How can we develop a system that tells us, over time, the way proteins evolve? Dayhoff developed an elegant statistical method do so, which she named PAM, “Accepted Point Mutations”. She aligned hundreds of proteins and derived the frequency with which the different amino acids substitute each other. Dayhoff introduced a more robust version [PDF] in 1978, once the number of proteins she could use was enlarged for her to count a large number of substitutions.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

BLAST, Basic Local Alignment Search Tool is the go-to computational workhorse in molecular biology. It is the most cited paper in life sciences, so probably the most influential paper in biology today. For the uninitiated: BLAST allows you to take a sequence of protein or DNA, and quickly search for similar sequences in a database containing millions. The search using one sequence takes seconds, or a few minutes at best. BLAST was actually introduced in another paper in 1990. However, the heuristics developed here allowed for the gapped alignment of sequences, and for searching for sequences which are less similar, with statistical robustness. BLAST changed everything in molecular biology, and moved biology to the data-rich sciences. If ever there was a case for giving the Nobel in Physiology or Medicine to a computational person, BLAST is it.

Durbin R., Eddy S., Krogh A and Mitchison G Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press 1998

The Moore Foundation solicitation asked for “works” rather than just “research papers”. If there is anything common to all bioinformatics labs, it’s this book. An overview of the basic sequence analysis methods. This books summarizes the pre-2000 foundation upon which almost all our knowledge is currently built: pairwise alignment, Markov Models, multiple sequence alignment, profiles, PSSMs, and phylogenetics.

Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genetics 25: 25-29

Not a research paper, and not a book, but a “commentary”. This work popularized to the use of ontologies in bioinformatics and cemented GO as the main ontology we use.

Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001 Aug 14;98(17):9748-53.

Sequence assembly using de-Bruijn graphs, making the assembly tractable for a large number of sequences. At the time, shotgun sequences produced by by Sanger sequencing could still be assembled in a finite time solving for a Hamiltonian path . Once next-generation sequencing data started pouring in, the use of de-Bruijn graphs and a Eulerian path became essential. For a great explanation of the methodological transition see this article in Nature Biotechnology

Yes, I know there are many deserving works not in here. When boiling down to five, the choice is almost arbitrary. If you feel offended that a work you like is not here, then I’m sorry.