Skip to content

Improve sqlite/bm25 search/indexer for Gitlab Docs

Overview

When custom-models enabled for AI Gateway, we use sqlite/bm25 search to get the related information from Gitlab Documentation. However, the results returned by the search can be improved.

Reproduce

  • Generate the docs: python3 scripts/custom_models/create_index.py -o tmp/docs.db
  • gem install sqlite3
  • irb:
require 'sqlite3'

db = SQLite3::Database.new "tmp/docs.db"
db.execute("SELECT metadata, content FROM doc_index WHERE processed MATCH ? ORDER BY bm25(doc_index) LIMIT ?", ['How do I change user password in GitLab', 20])

We ask to pull the documentation page related to How do I change user password in GitLab request. The following results are returned:

  • "{"Header1": "Installing a GitLab POC on Amazon Web Services (AWS)", "Header2": "Install GitLab and create custom AMI", "Header3": "Sign in for the first time", "filename": "doc/install/aws/index.md"}"
  • "{"Header1": "Rails console", "Header2": "Query the database using an Active Record model", "Header3": "Modifying Active Record objects", "filename": "doc/administration/operations/rails_console.md"}",
  • "{"Header1": "Troubleshooting repository mirroring", "Header2": "Deadline Exceeded", "filename": "doc/user/project/repository/mirror/troubleshooting.md"}"

The relevant result which is https://docs.gitlab.com/ee/user/profile/user_passwords.html#change-your-password is not returned at all.

Same with How do I fork a project request

Expected behavior

The search should return the most relevant results