Skip to content

Draft: POC: Local docs search with bm25

Eduardo Bonet requested to merge custom_models/bm25_docs_search into main

Relates to gitlab-org/gitlab#467484 (closed)

Problem to solve

We will support GitLab Duo for self-managed customers, and that includes Duo answering questions about documentation. Since they are air-gapped, VertexAi search is not an option.

This MR is a proof of concept for an in-memory indexing of the gitlab documentation using BM25. This index takes around 55MB

image

Discussion

Upsides:

  • Index are generated very fast, no dependencies on extra tools
  • Small foot print: no need for LLMs when compared to an embedding based approach
  • We can pre-generate the files beforehand, it could be either downloaded or when the instance is updated

Downsides:

  • Results seem to be heavily affected by the query, cleaning the query would be advisable. For example, searching for 'create issue' brings useful responses, but 'create issue?' doesn't
  • Not scalable to customer generated data

Alternatives:

Improvements:

  • The corpus can be heavily improved: tokenizing words, removing links, removing punctuation, etc
  • Fetch the corpus form a different place
  • generate the embending at startup, rather than at first request

Next steps:

  • Encode this in the self-hosted models blueprint
  • Evaluate this solution using CEF
  • Reimplement in production-grade code

Steps to reproduce

Reproducing:

  1. Enable self-hosted models in env.

    AIGW_CUSTOM_MODELS__ENABLED=True
  2. Download processed_docs.json and place it under tmp/processed_docs.json

    mkdir tmp
    curl https://gitlab.com/-/project/39903947/uploads/aaad4a510339d3b8e2962e5d860300c9/processed_docs.json -o tmp/processed_docs.json
  3. Make a query (either on http://localhost:5052/docs#/search/docs_v1_search_gitlab_docs_post or in the terminal)

curl -X 'POST' \
  'http://localhost:5052/v1/search/gitlab-docs' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "type": "string",
  "metadata": {
    "source": "string",
    "version": "17.0.0-pre"
  },
  "payload": {
    "query": "create an issue",
    "page_size": 4
  }
}'
Edited by Eduardo Bonet

Merge request reports