SPIKE: vector store benchmarking

Benchmarking and results

The benchmark runs two queries:

get 5 nearest vectors for a project that has the same number of records as gitlab-org/gitlab - simulating a project semantic search for gitlab
get 5 nearest vectors for a group that has the same number of records as gitlab-org - simulating a group semantic search for gitlab-org
get 5 nearest vectors without any filters

The same embeddings are used for the query vector across all benchmark runs.

In the benchmark we don't consider cold cache since warmed cache is what will be used in vast majority of cases, so we warm up before starting the benchmark.

The benchmark calculates a couple of performance-related metrics. It does not measure accuracy, recall, etc.:

Min, max, mean, median duration
Standard deviation and outliers (+-1 stddev from mean, not shown in table below)
Operations per second, computed as 1 / mean

Full results are stored in this folder.

Interpretation

For a small gitlab instance with relatively few docs, Elasticsearch is magnitudes faster than postgres.

3K Reference Architecture results - 3M documents

5K Reference Architecture - 10M documents

Postgres index concerns

Write throughput

Apart from the big performance difference, we must acknowledge the problems with a big HNSW index in postgres and write throughput. Given that postgres could only process a few hundred write ops/s could cause updates to be backed up severely, especially for data that is updated frequently like merge requests or code.

Index build time

We could not create an HNSW postgres index for 5M documents on any of gitlab's existing reference architecture machines, even n2-standard-32 (128GB). Using a n2-standard-64 (256GB) machine, we were able to create the index in less than a day and also had to bump maintenance_work_mem to 96GB which will probably not be recommended for production use-cases.

Setup

We took a dataset from hugging face with 35 million records pulled from wikipedia with text embeddings generated with 768 dimensions: https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings.

We created large VMs with 32 CPU cores and 128GB memory to read this dataset, assigned permissions to each record to have a distribution of projects and groups. We then read this into a postgres database and an Elasticsearch index. These were then copied to VMs running GitLab reference architectures and we ran the same benchmarks against both. We started with the 3K ref architecture.

The same resources are allocated and the same data is used for both.

Elasticsearch: version 8.15.1. Using an HNSW graph with a cosine knn query containing filters for project/group.
Postgres: version 16, pgvector version 0.7.4. Using an HNSW index with cosine distance.

Both use HNSW graphs to efficiently find similar vectors. We used the same graph parameters for both ES and pg:

m = 16
ef_construction = 100

Dataset

We ran into problems with getting 35M records into postgres with an HNSW index. Doing writes into a table with an HNSW is excruciatingly slow on a powerful machine (see specs above) it slows down to about 100 ops/s.

In order to do updates on the table, we dropped the index and that increased the throughput. We inserted 27M docs and then tried to build the index after the data was ingested and it ran for longer than 4 hours, maxing out 32 CPUs. We decided to kill it because there was no indication how long it would take to build the full index.

Instead, we decided to start at 3M records and see what postgres can handle in a reasonable timeframe.

Edited Oct 10, 2024 by Dmitry Gruzd