Draft: POC: Local docs search with bm25
Relates to gitlab-org/gitlab#467484 (closed)
Problem to solve
We will support GitLab Duo for self-managed customers, and that includes Duo answering questions about documentation. Since they are air-gapped, VertexAi search is not an option.
This MR is a proof of concept for an in-memory indexing of the gitlab documentation using BM25. This index takes around 55MB
Discussion
Upsides:
- Index are generated very fast, no dependencies on extra tools
- Small foot print: no need for LLMs when compared to an embedding based approach
- We can pre-generate the files beforehand, it could be either downloaded or when the instance is updated
Downsides:
- Results seem to be heavily affected by the query, cleaning the query would be advisable. For example, searching for 'create issue' brings useful responses, but 'create issue?' doesn't
- Not scalable to customer generated data
Alternatives:
- We can ship a prebuilt sqlite database with bm25 and the whole index. That way customers don't need to reindex, we can build one per version and storage still optmized. Sqlite is pretty powerful for this use case, and we can extend later even for vector search. Generating this database is also pretty quick, and exemplified on https://gitlab.com/gitlab-org/ai-powered/custom-models/pocs/gitlab-docs-indexer-poc/-/blob/main/notebooks/indexer.ipynb
Improvements:
- The corpus can be heavily improved: tokenizing words, removing links, removing punctuation, etc
- Fetch the corpus form a different place
- generate the embending at startup, rather than at first request
Next steps:
- Encode this in the self-hosted models blueprint
- Evaluate this solution using CEF
- Reimplement in production-grade code
Steps to reproduce
Reproducing:
-
Enable self-hosted models in env.
AIGW_CUSTOM_MODELS__ENABLED=True
-
Download processed_docs.json and place it under
tmp/processed_docs.json
mkdir tmp curl https://gitlab.com/-/project/39903947/uploads/aaad4a510339d3b8e2962e5d860300c9/processed_docs.json -o tmp/processed_docs.json
-
Make a query (either on http://localhost:5052/docs#/search/docs_v1_search_gitlab_docs_post or in the terminal)
curl -X 'POST' \
'http://localhost:5052/v1/search/gitlab-docs' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"type": "string",
"metadata": {
"source": "string",
"version": "17.0.0-pre"
},
"payload": {
"query": "create an issue",
"page_size": 4
}
}'
Edited by Eduardo Bonet