I am planning on storing scans from a mass spectrometer in a MySQL database and would like to know whether storing and analyzing this amount of data is remotely feasible. I know performance varies wildly depending on the environment, but I'm looking for the rough order of magnitude: will queries take 5 days or 5 milliseconds?

Input format

Each input file contains a single run of the spectrometer; each run is comprised of a set of scans, and each scan has an ordered array of datapoints. There is a bit of metadata, but the majority of the file is comprised of arrays 32- or 64-bit ints or floats.

Host system

|----------------+-------------------------------| | OS | Windows 2008 64-bit | | MySQL version | 5.5.24 (x86_64) | | CPU | 2x Xeon E5420 (8 cores total) | | RAM | 8GB | | SSD filesystem | 500 GiB | | HDD RAID | 12 TiB | |----------------+-------------------------------|

There are some other services running on the server using negligible processor time.

File statistics

|------------------+--------------| | number of files | ~16,000 | | total size | 1.3 TiB | | min size | 0 bytes | | max size | 12 GiB | | mean | 800 MiB | | median | 500 MiB | | total datapoints | ~200 billion | |------------------+--------------|

The total number of datapoints is a very rough estimate.

Proposed schema

I'm planning on doing things "right" (i.e. normalizing the data like crazy) and so would have a runs table, a spectra table with a foreign key to runs , and a datapoints table with a foreign key to spectra .

The 200 Billion datapoint question

I am going to be analyzing across multiple spectra and possibly even multiple runs, resulting in queries which could touch millions of rows. Assuming I index everything properly (which is a topic for another question) and am not trying to shuffle hundreds of MiB across the network, is it remotely plausible for MySQL to handle this?

Additional info

The scan data will be coming from files in the XML-based mzML format. The meat of this format is in the <binaryDataArrayList> elements where the data is stored. Each scan produces >= 2 <binaryDataArray> elements which, taken together, form a 2-dimensional (or more) array of the form [[123.456, 234.567, ...], ...] .

These data are write-once, so update performance and transaction safety are not concerns.

My naïve plan for a database schema is:

runs table

| column name | type | |-------------+-------------| | id | PRIMARY KEY | | start_time | TIMESTAMP | | name | VARCHAR | |-------------+-------------|

spectra table

| column name | type | |----------------+-------------| | id | PRIMARY KEY | | name | VARCHAR | | index | INT | | spectrum_type | INT | | representation | INT | | run_id | FOREIGN KEY | |----------------+-------------|

datapoints table

| column name | type | |-------------+-------------| | id | PRIMARY KEY | | spectrum_id | FOREIGN KEY | | mz | DOUBLE | | num_counts | DOUBLE | | index | INT | |-------------+-------------|

Is this reasonable?

So, as you may have been able to infer, I am the programmer, not the biologist in the lab, so I don't know the science nearly as well as the actual scientists.

Here's a plot of a single spectrum (scan) of the kind of data with which I'll be dealing:

The goal of the software is to figure out where and how significant the peaks are. We use a proprietary software package to figure this out now, but we want to write our own analysis program (in R) so we know what the heck is going on under the sheets. As you can see, the vast majority of the data are uninteresting, but we don't want to throw out potentially-useful data which our algorithm missed. Once we have a list of probable peaks with which we're satisfied, the rest of the pipeline will use that peak list rather than the raw list of datapoints. I suppose that it would be sufficient to store the raw datapoints as a big blob, so they can be reanalyzed if need be, but keep only the peaks as distinct database entries. In that case, there would be only a couple dozen peaks per spectrum, so the crazy scaling stuff shouldn't be as much of an issue.