It would be no exaggeration to say that biology has entered the era of the -ome. It started with the study of the genome, which identified all the DNA contained in the cells of a given species. From there, the availability of DNA chips allowed researchers to study the transcriptome, which is comprised of the portions of the genome that are made into an RNA copy. The obvious next step was to look at how much of the transcriptome was made into proteins, making proteomics an inevitability. In the view of the advocates of proteomics, the field should provide a far more accurate perspective of what's happening inside a cell.

But proteins don't lend themselves to any sort of convenient sequencing, and the proteome often contains a set of closely related proteins produced by alternate RNA transcripts and chemical modifications of the proteins themselves. Proteomics, as a result, has had a bit of an awkward start. To explore the reasons why, a consortium called the Human Proteome Organization sent a sample of 20 known proteins to 27 different proteomics labs; only seven identified them all, and only one accurately picked out all the information they should have. A paper describing the problems, which ranged from basic chemistry to analysis algorithms, was released by Nature Methods yesterday.

All 27 labs shared a common general approach to proteomics. The first step in the process is to partially separate the mixture of proteins by size and charge. This is accomplished by using a current to force them through a three-dimensional mesh of fibers in an acrylamide gel, which provides resistance proportional to a protein's size. This essentially splits a single sample up into many smaller ones, each containing a small subset of the original mix of proteins. The next trick is identifying the proteins in each of these samples.

The method of choice is mass spectroscopy. The proteins are chopped into smaller fragments using an enzyme called trypsin; these fragments are then vaporized, ionized, and shot through an electric field. This separates the fragments very precisely according to their mass, which depends on the sequence of the protein fragment. It's possible to calculate which known proteins contain a fragment of that mass (often, there's only one), allowing the identity of the protein to be determined. So, in this case, the gel step should have largely separated the 20 proteins; trypsin digestion should have then created a total of 22 fragments that are in the ideal range for mass spec identification. But, in all but one of the 27 labs, various things went wrong, and the authors of the paper did their best to figure out why.

Part of the issue was that, despite the fact that all of them used a similar general approach, there were lots of devils in the details, from the precise chemistry of individual steps to the equipment used and the analysis performed. So, for example, a number of the labs had issues with protein fragments that contained cysteine because of sample processing issues. Others performed procedures that led to contamination by a protein called BSA, which is often used during the calibration of the equipment; a number picked up keratins from human skin contaminants or the trypsin used in sample processing. Choice of equipment also played a role; labs that used something called a Fourier transform ion cyclotron tended to have better data to work with.

But the big problem wasn't the processing; the authors of the paper obtained the raw data from the labs that received samples in order to analyze the data themselves. That's where the fun started, as "the initially deposited data had several problems, including incomplete files, proprietary software formats, and screenshots of data displays in software rather than actual data files." Once they plowed through these problems, they discovered that most of the labs actually had generated data that was sufficient to identify the proteins in the sample, but failed to do so for various reasons.

These included analyzing the mass spec data against different protein databases, using different search methods, and different algorithms for predicting the masses of the proteins in the database used. Differences in protein terminology produced naming errors and false redundancies. Different sensitivity cutoffs eliminated sample proteins in some cases, while causing the identification of contaminants in others. In short, even when a lab was position to get things right, the software pipeline involved in the analysis often failed them.

The authors conclude with what's essentially a plea for standardization, so that the same analysis would produce the same results. That would include a normalized set of protein databases, analysis algorithms, and calibration procedures. Something of the sort would seem essential if the wider scientific community were to have any confidence that a proteomic analysis was actually informative.

Beyond that, the paper serves as a welcome caution. New approaches tend to arrive with a surplus of fanfare and hype, but developing them into something consistent and reliable is often a long, hard slog. But, until that's done, it's really difficult to know what the approach is actually good for.

Nature Methods, 2009. DOI: 10.1038/NMETH.1333