Vector store comparison

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Purpose

Keep a record of known advantages and disadvantages of evaluated vector stores.

Evaluated vector stores

Store

Project filter kNN duration*

Group filter kNN duration*

Advantages Disadvantages References
Elasticsearch and OpenSearch 12.0 ms 14.6 ms
  • Supports filters (NB for anything that's not public data)
  • Existing code path to efficiently index and search vectors at scale in production.
  • Faster query time than pgvector.
  • Supports hybrid search: KNN + keyword search.
  • Not all GitLab customers run Elasticsearch: requires additional cost, maintenance and legal approval.

Benchmark

PGVector 517.1 ms 521.3
  • All GitLab customers run postgres already.
  • Fast and accurate search for small datasets without using an index. The maximum documents before exceeding several seconds is:
    • 50k documents for 3K ref architecture
    • 1M documents for 50K ref architecture
  • Does not support filters when using an index.
  • Without an index, on the largest gitlab ref architecture (50K) it takes 4 minutes to search through 20 million records and 6 seconds to search through 5 million records. Smaller ref architectures take significantly longer. Link
  • Write throughput is too slow for production use when using an index Link
  • Does not support hybrid search when an HNSW index is present

Benchmark

PGVector issues

*durations are mean duration running on 5K ref architecture machines with a dataset of 10 million vectors.

Potential vector stores to evaluate

Edited by 🤖 GitLab Bot 🤖