What is mass spectrometry?

“Mass spectrometry is an analytical tool useful for measuring the mass-to-charge ratio (m/z) of one or more molecules present in a sample. These measurements can often be used to calculate the exact molecular weight of the sample components as well. Typically, mass spectrometers can be used to identify unknown compounds via molecular weight determination, to quantify known compounds, and to determine structure and chemical properties of molecules.”
(Source: Broad Institute)

Besides measuring real world substances with a mass spectrometer, theoretical mass spectra can also be calculated. A simple calculator using the mercury7 algorithm that runs in your browser can be found here. In any case, the output is a mass spectrum.

What does a mass spectrum look like?

(Image source: chem.libretexts.org)

As you can see, the spectrum consists of multiple peaks that show the intensity of each m/z value detected. In a relational database this data can be stored a table rows with columns m/z and intensity, however this requires reconstruction of the spectrum from this table every time the full spectrum is required.

Alternatively, the spectrum can be stored as m/z and intensity vectors. In vector notation, the above spectrum would look like this:

intensity: [10.0, 10.0, 50.0, 100.0, 48.0]

m/z: [79.0, 81.0, 158.0, 160.0, 162.0]

Modern high precision mass spectrometers can measure m/z down to four fractional digits, so a floating point data type must be used for storage.

What is a mass spectrum database good for anyway?

As said in the introduction, mass spectrometry is one, if not the tool to identify unknown compounds, to quantify known compounds, and to determine the structure of molecules. But it is a lot of work, and you need reference spectra to compare against.

So, there are curated databases of validated spectra available, like MassBank Japan, MassBank Europe and the NIST mass spectral libraries. Laboratories might also want to store their own libraries for future use.

However, such databases often come in their own formats and with their own retrieval software. If you need to efficiently connect spectra to other data, e.g. chemical structures or genomic data, this calls for central management and a common API.

And this is where PostgreSQL and its enormous extension ecosystem shine.

ARRAY vs. pgvector

The native way to handle vectors in PostgreSQL are ARRAYs. They are first class citizens of PostgreSQL and come with lots of functionality out of the box. But now there is pgvector, an extension for vector similarity search, originally designed to handle embeddings from ML models. Let’s see how pgvector compares to ARRAY.

Test Setup

A total of 109143 spectra were loaded from here in msp format, converted into SQLite3 by msp2db, and then loaded into the following TABLE:

CREATE TABLE massbase.spectra_index (
   library_spectra_meta_id int4 NOT NULL
   num_peaks int2 NULL,
   mz_range numrange NULL,
   mz float4[] NOT NULL,
   intensity float4[] NOT NULL,
   mz_vector public.vector NULL,
   intensity_vector public.vector NULL);

CREATE INDEX spectra_index_num_peaks_idx ON massbase.spectra_index USING btree (num_peaks);

The source TABLE generated by msp2db looks like this:

CREATE TABLE massbase.library_spectra (
   id int4 NULL,
   mz float4 NOT NULL,
   i float4 NOT NULL,
   other text NULL,
   library_spectra_meta_id int4 NOT NULL);

Since this uses the one table row per m/z-intensity pair format, some conversions had to be made.

Conversion into ARRAYs:

INSERT INTO massbase.spectra_index SELECT
   library_spectra_meta_id,
   array_agg(mz::numeric
ORDER BY
   mz asc) as mz,
   array_agg(i
ORDER BY
   mz ASC)
FROM
   massbase.library_spectra
GROUP BY
   library_spectra_meta_id
ORDER BY
   library_spectra_meta_id ASC;

Building the num_peaks value:

UPDATE massbase.spectra_index SET num_peaks = array_length(mz, 1);

Building the m/z range values:

UPDATE massbase.spectra_index SET mz_range = numrange(mz[1]::numeric, mz[num_peaks]::numeric, '[]');

And finally, building the vectors from the already existing ARRAYs:

UPDATE massbase.spectra_index SET mz_vector = mz;
UPDATE massbase.spectra_index SET intensity_vector = intensity;

Testing and Results

Testing consisted of running a cosine similarity / cosine distance query 10x against various implementations (SQL, pl/pgSQL, ARRAY in C, pgvector) and different PostgreSQL configurations (out-of-the-box install, tuned buffercache 2GB, INDEX on num_peaks, parallel query, and parallel query+INDEX).

The results were measured with EXPLAIN ANALYZE with milliseconds resolution, and the median and the average were taken.

The ARRAY query:

EXPLAIN ANALYZE SELECT
   library_spectra_meta_id
FROM
   massbase.spectra_index
WHERE
   num_peaks = 23
AND 
   1.0::float8 - massbase.cosine_similarity(mz, (
SELECT
   mz
FROM
   massbase.spectra_index
WHERE 
   library_spectra_meta_id = 3025
   and num_peaks = 23)) < 0.1;

The pgvector query:

EXPLAIN ANALYZE SELECT
   library_spectra_meta_id
FROM
   massbase.spectra_index
WHERE
   num_peaks = 23
AND 
   mz_vector <=> (
SELECT
   mz_vector
from
   massbase.spectra_index
WHERE 
   num_peaks = 23
AND 
   library_spectra_meta_id = 3025) < 0.1;

Out-of-the-box install, no tuning:

Run #	ARRAY pl/pgSQL	ARRAY SQL	pgvector	ARRAY C function
1	257	270	190	186
2	249	266	191	183
3	248	258	192	180
4	259	263	199	162
5	250	257	198	163
6	256	259	205	176
7	241	274	206	174
8	248	266	200	174
9	256	271	197	174
10	260	263	195	183

Median	253	264.5	197.5	175
Arithmetic Mean	252.4	264.5	197.3	175.5

Tuned buffer cache to 2 GB RAM:

Run #	ARRAY pl/pgSQL	ARRAY SQL	pgvector	ARRAY C function
1	102	112	53	105
2	113	114	62	102
3	104	119	56	94
4	103	115	54	103
5	102	113	55	96
6	109	120	54	93
7	110	117	52	94
8	107	114	57	95
9	109	117	59	101
10	102	123	56	95

Median	105.5	116	55.5	95.5
Arithmetic Mean	106.1	116	55.8	97.8

Tuned buffer cache to 2 GB RAM and btree index on num_peaks column:

Run #	ARRAY pl/pgSQL	ARRAY SQL	pgvector	ARRAY C function
1	53	55	4	5
2	61	64	4	5
3	46	56	4	5
4	49	55	6	4
5	46	57	4	4
6	49	58	4	4
7	49	64	4	6
8	50	60	3	7
9	55	60	5	4
10	47	63	4	4

Median	49	59	4	4.5
Arithmetic Mean	50.5	59	4.2	4.8

Tuned buffer cache to 2 GB RAM and parallel query:

Run #	ARRAY pl/pgSQL	ARRAY SQL	pgvector*	ARRAY C function*
1	108	108	56	89
2	107	104	53	93
3	111	107	63	97
4	106	109	56	96
5	103	109	51	89
6	102	103	57	95
7	106	106	57	91
8	107	123	57	94
9	110	112	59	91
10	107	105	60	90

Median	107	107.5	57	92
Arithmetic Mean	106.7	107.5	56.9	92.5
* never parallelized

Tuned buffer cache to 2 GB RAM and parallel query and btree index on num_peaks column (Spoiler: It never parallelized):

Run #	ARRAY pl/pgSQL	ARRAY SQL	pgvector	ARRAY C function
1	52	53	4	4
2	52	57	4	5
3	47	60	3	4
4	54	55	4	5
5	47	70	5	5
6	47	58	4	5
7	45	59	4	5
8	56	64	5	5
9	55	59	5	4
10	48	57	4	4

Median	50	58,5	4	5
Arithmetic Mean	50.3	58,5	4.2	4.6

Test system:

CPU	Intel Core-i5-10210U (4 cores / 8 threads)
RAM	8 GB
SSD	WDC PC SN730 SDBPNTY-512G-1036 (NVM Express, PCIe 3.0 x4)

Findings

With ARRAY, C functions outperform pl/pgSQL and SQL functions in all cases, ~1.44x at worst (253 ms : 175 ms), and ~10x at best (50ms : 5ms).
On an unoptimized instance, full table scans dominate query runtime
A tuned PostgreSQL server and indexing significantly accelerate queries in all cases
Parallel queries accelerate pl/pgSQL and SQL functions.
C function queries were never parallelized
With a suitable INDEX present, no query was parallelized

Conclusion

The C function using ARRAY is almost as fast as pgvector, but, because of the PostgreSQL ARRAY API, it is difficult to vectorize on CPU (SIMD: MMX, SSE, AVX, NEON).

The pgvector vector type is much better suited for SIMD processing, and it is quite easy to use it in your own C functions, only #include “vector.h“ is needed. No linking against the pgvector library itself is required. So, pgvector is also a good foundation to build your own vector similarity metrics, like spectral contrast angle or hamming distance.

Repository

All code used can be found here.

Peak Performance – pgvector vs. PostgreSQL ARRAY for mass spectrometry databases

What is mass spectrometry?

What does a mass spectrum look like?

What is a mass spectrum database good for anyway?

ARRAY vs. pgvector

Test Setup

Testing and Results

The ARRAY query:

The pgvector query:

Out-of-the-box install, no tuning:

Tuned buffer cache to 2 GB RAM:

Tuned buffer cache to 2 GB RAM and btree index on num_peaks column:

Tuned buffer cache to 2 GB RAM and parallel query:

Tuned buffer cache to 2 GB RAM and parallel query and btree index on num_peaks column (Spoiler: It never parallelized):

Test system:

Findings

Conclusion

Repository

Leave A Comment Cancel Comment

Our Projects

Quick Links

Contact Info

Follow us at

Archives

Categories