I am planning on storing scans from a mass spectrometer in a MySQL database and
would like to know whether storing and analyzing this amount of data is remotely
feasible. I know performance varies wildly depending on the environment, but I'm
looking for the rough order of magnitude: will queries take 5 days or 5
milliseconds?
Input format
Each input file contains a single run of the spectrometer; each run is comprised
of a set of scans, and each scan has an ordered array of datapoints. There is a
bit of metadata, but the majority of the file is comprised of arrays 32- or
64-bit ints or floats.
Host system
|----------------+-------------------------------|
| OS             | Windows 2008 64-bit           |
| MySQL version  | 5.5.24 (x86_64)               |
| CPU            | 2x Xeon E5420 (8 cores total) |
| RAM            | 8GB                           |
| SSD filesystem | 500 GiB                       |
| HDD RAID       | 12 TiB                        |
|----------------+-------------------------------|
There are some other services running on the server using negligible processor
time.
File statistics
|------------------+--------------|
| number of files  | ~16,000      |
| total size       | 1.3 TiB      |
| min size         | 0 bytes      |
| max size         | 12 GiB       |
| mean             | 800 MiB      |
| median           | 500 MiB      |
| total datapoints | ~200 billion |
|------------------+--------------|
The total number of datapoints is a very rough estimate.
Proposed schema
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a runs table, a spectra table with a foreign key to runs,
and a datapoints table with a foreign key to spectra.
The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple
runs, resulting in queries which could touch millions of rows. Assuming I index
everything properly (which is a topic for another question) and am not trying to
shuffle hundreds of MiB across the network, is it remotely plausible for MySQL
to handle this?
UPDATE: additional info
The scan data will be coming from files in the XML-based
mzML format. The meat of this format is in the
<binaryDataArrayList> elements where the data is stored. Each scan produces =
2 <binaryDataArray> elements which, taken together, form a 2-dimensional (or
more) array of the form [[123.456, 234.567, ...], ...].
These data are write-once, so update performance and transaction safety are not
concerns.
My naïve plan for a database schema is:
runs table
| column name | type        |
|-------------+-------------|
| id          | PRIMARY KEY |
| start_time  | TIMESTAMP   |
| name        | VARCHAR     |
|-------------+-------------|
spectra table
| column name    | type        |
|----------------+-------------|
| id             | PRIMARY KEY |
| name           | VARCHAR     |
| index          | INT         |
| spectrum_type  | INT         |
| representation | INT         |
| run_id         | FOREIGN KEY |
|----------------+-------------|
datapoints table
| column name | type        |
|-------------+-------------|
| id          | PRIMARY KEY |
| spectrum_id | FOREIGN KEY |
| mz          | DOUBLE      |
| num_counts  | DOUBLE      |
| index       | INT         |
|-------------+-------------|
Is this reasonable?