Copyright: © 2010 Russell F. Doolittle. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Many other authors have already recorded their thoughts on the evolutionary roots of bioinformatics in accounts that are doubtless more thorough and balanced than can be recorded in this brief personal reflection ( [1] , [2] , inter alia). All are in agreement about certain pivotal events that were true milestones: the double-helix model of DNA, the first determination of the amino acid sequence of a protein, and the conceptual linking of DNA sequences and protein sequences. My plan is to expand on some related matters with the hope of providing some additional background on those early scenes.

Bioinformatics as a formal discipline came of age in the late 1980s, greatly stimulated by the 1989 Human Genome Initiative. The roots of the field go back several decades earlier, however, to an era when computers were not needed to manage the data. In this personal reflection, I review the confluence of events beginning in the 1950s that brought a number of fields together in a common pursuit. Particularly, I offer some comments about early amino acid sequence comparisons, the results of which revealed so much about evolution, and how the computer became necessary only when the number of known sequences began to grow exponentially.

Sequences

Sequences, the simple order of individual units in biological polymers, are at the heart of bioinformatics, and the search for relationships among them and the reconstruction of their histories has arguably proved the most informative of biological inquiries. Today dozens of giant data banks store what seem to be countless numbers of nucleic acid and protein sequences. But there was a time, only 50 or 60 years ago, when hardly any sequences were known at all. Nonetheless, there were those who already appreciated that the web of all life would eventually be reconstructed on the basis of sequence data alone. There was an obligatory progression of events, beginning with chemistry, then biology, and, finally, the need for computers.

Among the technological advances that made sequence determinations possible, two are extremely notable: the introduction in the 1940s of paper chromatography as a simple tool for identifying amino acids and their derivatives [3], for one, and the use of suitable chemical reagents that reacted (more or less) exclusively with amino groups, for another—particularly an amino-tagging reagent by Sanger [4] and an amino acid-labilizing reagent by Edman [5]. Some important details of their seminal and unique contributions need to be described here, however briefly.

Chemistry It must be difficult for a young scientist today to imagine how primitive circumstances were in the mid-20th century. The effort needed to determine even a short amino acid sequence was more than considerable; it was daunting (some of that tedium may carry through in the following description). Typically, the first step in determining the sequence of a peptide or protein was to establish its amino acid composition. It was well known that heating a protein or peptide with strong aqueous acid broke the bonds between the constituent amino acids (unhappily, glutamines and asparagines were changed into glutamic and aspartic acids in the process, and a few other amino acids like tryptophan damaged). The resulting hydrolysate could be spotted on a large piece of filter paper and separation of the various amino acids obtained by letting an organic solvent creep over the paper, partitioning the amino acids according to their relative solubilities in one phase or the other. The locations of the amino acids could be found by staining the dried paper with ninhydrin, a compound that gave a blue color with amino groups. After a preliminary amino acid composition was in hand, the next step was to break the protein or peptide into smaller pieces (the “divide and conquer” strategy). The simplest method was to use partial acid hydrolysis, taking advantage of the fact that bonds next to some amino acids break more easily than others. The other popular option was to use proteolytic enzymes like trypsin or chymotrypsin. In either case, the peptide fragments were purified, often by paper chromatography, and their individual amino acid compositions determined. Indeed, one reason that protein sequences were attacked first, rather than RNA or DNA, was because there were 20 different amino acids, and a random, partial hydrolysis of a polypeptide chain could give rise to smaller peptides with unique compositions. The logistics of the same approach for a polymer made of only four different things was impossible to contemplate.

More Chemistry The Sanger reagent, fluorodinitrobenzene (FDNB), had several important features. First, the bond between it and the tagged amino acid was resistant to acid hydrolysis; second, the derivatized amino acid was sufficiently non-polar that it could be extracted from the acid hydrolysate with an organic solvent like ether; and finally, the derivatives were bright yellow and could be readily identified by paper chromatography. The operation could be conducted on the starting peptide or protein, as well as on the fragments generated by various means. It was a slow and arduous process and very much limited to small-ish proteins. The Edman reagent, phenylisothiocyanate (PheNCS), utilized a completely different strategy. A related compound phenylisocyanate (PheNCO) had previously been shown to be a labilizing agent that could tag and release an amino acid from the amino-terminus of a peptide and had been used successfully on a tripeptide as long ago as 1930 [6]. Edman's PheNCS was much superior, however, as the sulfur atom was much more favorably disposed to the second step of the operation, a rearrangement that led to the separation of the terminal amino acid from the parent peptide or protein in anhydrous acid, conditions that left the remaining peptide bonds intact. As a result, the operation could be repeated over and over again, alternating between a coupling reaction at high pH and cleavage in dry acid, liberating one amino acid at a time from the amino-terminal end. The cleaved residues, labeled as they were with the PheNCS, could be extracted with an organic solvent and, once again, identified by paper chromatography. Because the coupling and cleavage at each step was never 100 percent complete, the operation tended to get out of phase and was no longer informative after several cycles. Additionally, the parent peptide tended to wash away during the repeated extractions, imposing a further limit on how many cycles could be conducted successfully. Nonetheless, it was an elegant method, even allowing for the typical procedure being limited to one amino acid cycle per day. Moreover, procedures that combined the Sanger and Edman approaches were devised, and these speeded up determinations significantly. Column partition methods for separating peptides and amino acids were also being developed during this period, and the introduction of an automatic amino acid analyzer in the late 1950s was much heralded by the claim that a single analysis could now be performed in as little as 24 hours [7]!

Biology From the sequence perspective, the 1950s was largely a decade of polypeptide hormones, several of which exhibited distinct similarities to each other. The amino acid sequence of bovine (cattle) insulin, which is composed of two chains totaling 51 residues, was completed in 1955 [8], as was pig corticotropin, a single polypeptide chain of 39 residues [9]. As was the case for insulin, corticotropins from several species revealed a variability restricted to one small region. Once the technology was developed for determining protein sequences, choices had to be made about which proteins to study. The necessary restrictions were that they be abundant, small, and easy to purify (and fundable). As it happened, a relatively small group of such proteins was able to provide insights into the two subjects of most interest to evolutionists, which were intra- and inter-species sequence variability, for one, and gene duplications and the evolution of new proteins, for the other. The most popular proteins for study in the 1950s were—in order of increasing size—cytochome c, ribonuclease, hemoglobin, and the serine proteases. The first of these to be completed, and the first of more than a hundred residues, was cytochrome c [10]. It may seem a meager list today, but this small cast set the stage for all that was to follow. Hemoglobin was probably the most illuminating, providing the most useful data on several fronts. By this time it was known that most vertebrate hemoglobins were composed of two pairs of subunits the size of myoglobin, and these were genetically endowed in the fashion of one gene, one polypeptide chain. The discovery in 1949 that an apparent single amino acid replacement in hemoglobin could lead to a disease in which red blood cells became sickle shaped was a blockbuster [11]. The impact was almost as great 9 years later when Vernon Ingram showed that the particular replacement was a valine for a glutamic acid [12]. In line with the techniques of the day, Ingram had first digested normal and sickle cell hemoglobins with trypsin and then used a combination of paper chromatography and electrohoresis to make a two-dimensional map of the resulting peptides. A comparison of maps made from normal and sickle cell hemoglobins showed that only one of the spots had shifted its position, and the amino acid composition of that peptide showed that the change was from a glutamic acid in the normal hemoglobin to a valine in hemoglobin S. The combination method of paper electrophoresis and chromatography, which Ingram called “fingerprinting,” was quickly taken up by other labs for identifying changes in other variant hemoglobins, a large number of which had been identified clinically. Fingerprinting was also a simple way for comparing hemoglobins and other proteins from different species, and several other laboratories promptly undertook such studies. Emile Zuckerkandl and Dick Jones, working in Linus Pauling's laboratory, began a study of hemoglobins from different species by this method [13], and workers in Chris Anfinsen's group began charting differences in various animal ribonucleases [14]. By this time, also, full determinations of cytochrome c sequences from several sources were under way in several laboratories. In 1959, Anfinsen's book The Molecular Basis of Evolution appeared [15]. This slender volume provided some basic paleontology and genetics as background, as well as the rudiments of DNA and protein structure, including a few simple sequence comparisons of hormones and partially sequenced proteins. Anfinsen coupled his discussions with some bold pronouncements about how DNA sequences must be correlated with amino acid sequences. Even though the genetic code was yet to be deciphered, he conjured up a fictitious set of base triplets and showed how single base substitutions in the gene for the human hemoglobin β chain could change the wild type glutamic acid into the valine found in hemoglobin S and how another single base change at the same position could yield the lysine found in hemoglobin C.