In part one of our series on scientific data preservation, we spent some time discussing the challenges of making sure the samples used to generate scientific data get kept around. It might seem that there's an obvious solution to that issue: document things, digitize them, and take advantage of the rapid increases in hard drive capacity. After all, that's what we do with data from one-time events, like earthquakes and astronomical events. It's a nice thought, but two recent developments point out that it's little more than wishful thinking.

The first is that, as the LHC has ramped up the pace of its collisions, software filters have kicked in that are starting to determine which events actually get archived. Instead of a "preserve everything" approach to scientific data, the people running the LHC are now taking a "preserve the interesting stuff and a random sample of the rest" approach. As collision intensities continue to ramp up, that random sample will be an ever-shrinking slice of the full complement of events taking place. At full beam intensity, three levels of filtering will take place, each of which will discard all but one of every 10,000 collisions recorded.

It's possible to write this off as an exception, as the LHC is a one-of-a-kind, multibillion dollar machine, and we won't build another one like it. But physics isn't the only field that's drowning in a flood of data, as a chart from a paper in genome biology reveals. For decades, hard drive capacity per dollar has been doubling every 14 months, comfortably outstripping the first decade's DNA sequencing productivity. But, since the second-generation DNA sequencing machines appeared on the market, the base pairs per dollar figure has been doubling every five months. Genome sequencing centers are now struggling to cope with the flood.

The reality is that we simply can't save everything. And, as a result, scientists have to fall back on judgement calls, both professional and otherwise, in determining what to keep and how to keep it. We'll consider the what now, and deal with the how separately.

What's really "raw," anyway?

Digital cameras actually provide an illustrative example of the challenges of digitizing scientific data. Ostensibly, the images are simply a recording of the light that hits a sensor at a specific time and place. Except nobody ever actually gets that. Low-end cameras don't output raw files; high-end cameras correct for bad pixels and the like in hardware; things are date stamped only if you set the camera's clock properly, and you're on your own when it comes to location data. Even if all those hurdles can be cleared, the end results need to be in a format that's easy to interpret and compact, with extraneous data culled and random variability compensated for. So, even though digitization could help provide an exact record, it generally doesn't.

Equivalents to all of this exist for scientific data. It's easy argue that science should be focused on preserving the raw data, but lots of hardware doesn't even provide raw data anymore, and a lot of the processing is there simply to compensate for defective hardware.

On the low end, the purity and concentration of DNA solutions can be determined by using a spectrophotometer to measure the absorption of specific wavelengths of light and performing some minor calculations. Most companies ultimately recognized that this is what their machines were being used for, and put together a one-button program that did all the work, and spit out the concentration. Nobody ever sees the raw data anymore. And the number never gets used directly; instead, the concentration used for a given experiment (which is derived from the figure the machine spits out) is what ends up being recorded.

DNA sequencing machines are somewhere in the middle. For most machines, the raw data is a curve that represents light emissions. But, obviously, light emissions on their own don't actually say anything about a DNA base. Instead, the metadata associated with it can be converted into a more informative format—for Sanger sequencing, the results are a trace file, which looks like this:

This lets a trained eye determine the identity of the base, and a sense of the confidence in this call, based on the height and shape of the curve. But the human genome would have never gotten completed if people had to make judgments about every single base, or manually track which sequencing project every trace file belonged to. Instead, algorithms were developed that make base calls and provide a quality score for each base that reflects the confidence, and an analysis and storage pipeline developed to ensure that samples were properly identified.

By this point, we're well removed from the raw data, which was actually the amount of light registered by a digital sensor. But the National Center for Biotechnology Information, which is the organization tasked with storing the genome data in the US, stopped requiring trace data with the completion of the human genome. And, for many of the nongenomic sequences it stores, it relies on the investigators that submit data to retain the original copy used to derive the sequences. Some of that dates back to the era of radioactive sulfur and X-ray films, done in labs that have since shut down—you can safely bet that it's no longer available.

At the high end, things like bad pixels effect even the most sophisticated instruments, but NASA calibrates its instruments before scientific use, and releases the adjusted data to the scientific community. So, you can safely assume that recent dumps of Kepler data doesn't include any artifacts caused by bad hardware.

Similar things happen for other hardware. For example, the satellite NASA uses to track ocean levels (currently, Jason-1) suffers from a gradual orbital decay that slowly brings it closer to the ocean. Its raw data is essentially meaningless—anyone wanting to actually analyze ocean levels needs to use the adjusted data provided by NASA.

The decisions on how close to the initial instrument to go when saving raw data is, again, something that requires case-by-case decisions based on the instruments and scientific needs, so it's not possible to set a one-size-fits-all policy for preservation. Scientifically, it can make all the difference, as two cases can illustrate.

One of the early controversies in the area of climate science arose over discrepancies between the measurement of surface temperatures and satellite-derived measurements of the lower atmosphere. Eventually, however, various sources of instrument error in the satellite record were identified; when corrected for, the two records were brought into rough agreement (you can get a sense of some of the issues from this paper).

Earlier in June came another calibration controversy, this one about data from NASA's WMAP satellite, which images the microwave background that resulted from the Big Bang. Although the satellite's original calibration seems to have been widely accepted by the cosmology community, a separate group is apparently claiming that it can perform a different calibration and get results that do away with dark matter and energy.

Without everyone having access to the original, uncalibrated data, none of this would be possible. But, despite its potential importance, scientists in other fields are making justifiable decisions to pitch everything but heavily processed data.

All of this, of course, assumes that we can manage to keep any of the data around long enough to argue about it in the first pace, an assumption that, as we'll see, is often sorely tested.