Stormatics
(678) 345-3456
380 Albert St, Melbourne, Australia

Blog Details

  • Stormatics
  • Blog
  • Peak Performance – pgvector vs. PostgreSQL ARRAY for mass spectrometry databases

Peak Performance – pgvector vs. PostgreSQL ARRAY for mass spectrometry databases

What is mass spectrometry?

“Mass spectrometry is an analytical tool useful for measuring the mass-to-charge ratio (m/z) of one or more molecules present in a sample. These measurements can often be used to calculate the exact molecular weight of the sample components as well. Typically, mass spectrometers can be used to identify unknown compounds via molecular weight determination, to quantify known compounds, and to determine structure and chemical properties of molecules.”
(Source: Broad Institute)

Besides measuring real world substances with a mass spectrometer, theoretical mass spectra can also be calculated. A simple calculator using the mercury7 algorithm that runs in your browser can be found here. In any case, the output is a mass spectrum.

What does a mass spectrum look like?

(Image source: chem.libretexts.org)
 

As you can see, the spectrum consists of multiple peaks that show the intensity of each m/z value detected. In a relational database this data can be stored a table rows with columns m/z and intensity, however this requires reconstruction of the spectrum from this table every time the full spectrum is required.

Alternatively, the spectrum can be stored as m/z and intensity vectors. In vector notation, the above spectrum would look like this:

intensity: [10.0, 10.0, 50.0, 100.0, 48.0]
m/z: [79.0, 81.0, 158.0, 160.0, 162.0]

Modern high precision mass spectrometers can measure m/z down to four fractional digits, so a floating point data type must be used for storage.

What is a mass spectrum database good for anyway?

As said in the introduction, mass spectrometry is one, if not the tool to identify unknown compounds, to quantify known compounds, and to determine the structure of molecules. But it is a lot of work, and you need reference spectra to compare against.

So, there are curated databases of validated spectra available, like MassBank Japan, MassBank Europe and the NIST mass spectral libraries. Laboratories might also want to store their own libraries for future use.

However, such databases often come in their own formats and with their own retrieval software. If you need to efficiently connect spectra to other data, e.g. chemical structures or genomic data, this calls for central management and a common API.

And this is where PostgreSQL and its enormous extension ecosystem shine.

ARRAY vs. pgvector

The native way to handle vectors in PostgreSQL are ARRAYs. They are first class citizens of PostgreSQL and come with lots of functionality out of the box. But now there is pgvector, an extension for vector similarity search, originally designed to handle embeddings from ML models. Let’s see how pgvector compares to ARRAY.

Test Setup

A total of 109143 spectra were loaded from here in msp format, converted into SQLite3 by msp2db, and then loaded into the following TABLE:

CREATE TABLE massbase.spectra_index (
library_spectra_meta_id int4 NOT NULL
num_peaks int2 NULL,
mz_range numrange NULL,
mz float4[] NOT NULL,
intensity float4[] NOT NULL,
mz_vector public.vector NULL,
intensity_vector public.vector NULL);

CREATE INDEX spectra_index_num_peaks_idx ON massbase.spectra_index USING btree (num_peaks);

The source TABLE generated by msp2db looks like this:

CREATE TABLE massbase.library_spectra (
id int4 NULL,
mz float4 NOT NULL,
i float4 NOT NULL,
other text NULL,
library_spectra_meta_id int4 NOT NULL);

Since this uses the one table row per m/z-intensity pair format, some conversions had to be made.

Conversion into ARRAYs:

INSERT INTO massbase.spectra_index SELECT
library_spectra_meta_id,
array_agg(mz::numeric
ORDER BY
mz asc) as mz,
array_agg(i
ORDER BY
mz ASC)
FROM
massbase.library_spectra
GROUP BY
library_spectra_meta_id
ORDER BY
library_spectra_meta_id ASC;

Building the num_peaks value:

UPDATE massbase.spectra_index SET num_peaks = array_length(mz, 1);

Building the m/z range values:

UPDATE massbase.spectra_index SET mz_range = numrange(mz[1]::numeric, mz[num_peaks]::numeric, '[]');

And finally, building the vectors from the already existing ARRAYs:

UPDATE massbase.spectra_index SET mz_vector = mz;
UPDATE massbase.spectra_index SET intensity_vector = intensity;

Testing and Results

Testing consisted of running a cosine similarity / cosine distance query 10x against various implementations (SQL, pl/pgSQL, ARRAY in C, pgvector) and different PostgreSQL configurations (out-of-the-box install, tuned buffercache 2GB, INDEX on num_peaks, parallel query, and parallel query+INDEX).

The results were measured with EXPLAIN ANALYZE with milliseconds resolution, and the median and the average were taken.

The ARRAY query:

EXPLAIN ANALYZE SELECT
library_spectra_meta_id
FROM
massbase.spectra_index
WHERE
num_peaks = 23
AND
1.0::float8 - massbase.cosine_similarity(mz, (
SELECT
mz
FROM
massbase.spectra_index
WHERE
library_spectra_meta_id = 3025
and num_peaks = 23)) < 0.1;

The pgvector query:

EXPLAIN ANALYZE SELECT
library_spectra_meta_id
FROM
massbase.spectra_index
WHERE
num_peaks = 23
AND
mz_vector <=> (
SELECT
mz_vector
from
massbase.spectra_index
WHERE
num_peaks = 23
AND
library_spectra_meta_id = 3025) < 0.1;

Out-of-the-box install, no tuning:

Run #

ARRAY pl/pgSQL

ARRAY SQL

pgvector

ARRAY C function

1

257

270

190

186

2

249

266

191

183

3

248

258

192

180

4

259

263

199

162

5

250

257

198

163

6

256

259

205

176

7

241

274

206

174

8

248

266

200

174

9

256

271

197

174

10

260

263

195

183

     

Median

253

264.5

197.5

175

Arithmetic Mean

252.4

264.5

197.3

175.5

Tuned buffer cache to 2 GB RAM:

Run #

ARRAY pl/pgSQL

ARRAY SQL

pgvector

ARRAY C function

1

102

112

53

105

2

113

114

62

102

3

104

119

56

94

4

103

115

54

103

5

102

113

55

96

6

109

120

54

93

7

110

117

52

94

8

107

114

57

95

9

109

117

59

101

10

102

123

56

95

     

Median

105.5

116

55.5

95.5

Arithmetic Mean

106.1

116

55.8

97.8

Tuned buffer cache to 2 GB RAM and btree index on num_peaks column:

Run #

ARRAY pl/pgSQL

ARRAY SQL

pgvector

ARRAY C function

1

53

55

4

5

2

61

64

4

5

3

46

56

4

5

4

49

55

6

4

5

46

57

4

4

6

49

58

4

4

7

49

64

4

6

8

50

60

3

7

9

55

60

5

4

10

47

63

4

4

     

Median

49

59

4

4.5

Arithmetic Mean

50.5

59

4.2

4.8

Tuned buffer cache to 2 GB RAM and parallel query:

Run #

ARRAY pl/pgSQL

ARRAY SQL

pgvector*

ARRAY C function*

1

108

108

56

89

2

107

104

53

93

3

111

107

63

97

4

106

109

56

96

5

103

109

51

89

6

102

103

57

95

7

106

106

57

91

8

107

123

57

94

9

110

112

59

91

10

107

105

60

90

     

Median

107

107.5

57

92

Arithmetic Mean

106.7

107.5

56.9

92.5

* never parallelized

    

Tuned buffer cache to 2 GB RAM and parallel query and btree index on num_peaks column (Spoiler: It never parallelized):

Run #

ARRAY pl/pgSQL

ARRAY SQL

pgvector

ARRAY C function

1

52

53

4

4

2

52

57

4

5

3

47

60

3

4

4

54

55

4

5

5

47

70

5

5

6

47

58

4

5

7

45

59

4

5

8

56

64

5

5

9

55

59

5

4

10

48

57

4

4

     

Median

50

58,5

4

5

Arithmetic Mean

50.3

58,5

4.2

4.6

Test system:

CPU

Intel Core-i5-10210U (4 cores / 8 threads)

RAM

8 GB

SSD

WDC PC SN730 SDBPNTY-512G-1036 (NVM Express, PCIe 3.0 x4)

Findings

  1. With ARRAY, C functions outperform pl/pgSQL and SQL functions in all cases, ~1.44x at worst (253 ms : 175 ms), and ~10x at best (50ms : 5ms).
  2. On an unoptimized instance, full table scans dominate query runtime
  3. A tuned PostgreSQL server and indexing significantly accelerate queries in all cases
  4. Parallel queries accelerate pl/pgSQL and SQL functions.
  5. C function queries were never parallelized
  6. With a suitable INDEX present, no query was parallelized

Conclusion

The C function using ARRAY is almost as fast as pgvector, but, because of the PostgreSQL ARRAY API, it is difficult to vectorize on CPU (SIMD: MMX, SSE, AVX, NEON).

The pgvector vector type is much better suited for SIMD processing, and it is quite easy to use it in your own C functions, only #include “vector.h“ is needed. No linking against the pgvector library itself is required. So, pgvector is also a good foundation to build your own vector similarity metrics, like spectral contrast angle or hamming distance.

Repository

All code used can be found here.

Leave A Comment