Can MySQL reasonably perform queries on billions of rows?

Posted by haxney on Stack Overflow See other posts from Stack Overflow or by haxney
Published on 2012-07-02T19:36:13Z Indexed on 2012/07/03 15:16 UTC
Read the original article Hit count: 305

Filed under:

I am planning on storing scans from a mass spectrometer in a MySQL database and would like to know whether storing and analyzing this amount of data is remotely feasible. I know performance varies wildly depending on the environment, but I'm looking for the rough order of magnitude: will queries take 5 days or 5 milliseconds?

Input format

Each input file contains a single run of the spectrometer; each run is comprised of a set of scans, and each scan has an ordered array of datapoints. There is a bit of metadata, but the majority of the file is comprised of arrays 32- or 64-bit ints or floats.

Host system

|----------------+-------------------------------|
| OS             | Windows 2008 64-bit           |
| MySQL version  | 5.5.24 (x86_64)               |
| CPU            | 2x Xeon E5420 (8 cores total) |
| RAM            | 8GB                           |
| SSD filesystem | 500 GiB                       |
| HDD RAID       | 12 TiB                        |
|----------------+-------------------------------|

There are some other services running on the server using negligible processor time.

File statistics

|------------------+--------------|
| number of files  | ~16,000      |
| total size       | 1.3 TiB      |
| min size         | 0 bytes      |
| max size         | 12 GiB       |
| mean             | 800 MiB      |
| median           | 500 MiB      |
| total datapoints | ~200 billion |
|------------------+--------------|

The total number of datapoints is a very rough estimate.

Proposed schema

I'm planning on doing things "right" (i.e. normalizing the data like crazy) and so would have a runs table, a spectra table with a foreign key to runs, and a datapoints table with a foreign key to spectra.

The 200 Billion datapoint question

I am going to be analyzing across multiple spectra and possibly even multiple runs, resulting in queries which could touch millions of rows. Assuming I index everything properly (which is a topic for another question) and am not trying to shuffle hundreds of MiB across the network, is it remotely plausible for MySQL to handle this?

UPDATE: additional info

The scan data will be coming from files in the XML-based mzML format. The meat of this format is in the <binaryDataArrayList> elements where the data is stored. Each scan produces >= 2 <binaryDataArray> elements which, taken together, form a 2-dimensional (or more) array of the form [[123.456, 234.567, ...], ...].

These data are write-once, so update performance and transaction safety are not concerns.

My naïve plan for a database schema is:

`runs` table

| column name | type        |
|-------------+-------------|
| id          | PRIMARY KEY |
| start_time  | TIMESTAMP   |
| name        | VARCHAR     |
|-------------+-------------|

`spectra` table

| column name    | type        |
|----------------+-------------|
| id             | PRIMARY KEY |
| name           | VARCHAR     |
| index          | INT         |
| spectrum_type  | INT         |
| representation | INT         |
| run_id         | FOREIGN KEY |
|----------------+-------------|

`datapoints` table

| column name | type        |
|-------------+-------------|
| id          | PRIMARY KEY |
| spectrum_id | FOREIGN KEY |
| mz          | DOUBLE      |
| num_counts  | DOUBLE      |
| index       | INT         |
|-------------+-------------|

Is this reasonable?

Developer IT