Add Tanuki Bot indexer (!117800) · Merge requests · GitLab.org / GitLab

Terri Chu requested to merge tchu-tanuki-bot-indexer into master Apr 17, 2023

What does this MR do and why?

Repopulates the Embedding::TanukiBotMvc table on a schedule. This is because the table is created from GitLab documentation which is constantly being updated.

Because every record needs an embedding which is determined via making an API call to OpenAI and there are 2000 records, it is not a good idea to drop and recreate everything in one transaction. Instead, we are opting for a spread-out approach using a new version field on the table:

We have 1 coordinating cron worker which:

Reads every file (currently only the /doc/ directory but could be expanded)
Parses the content of the file into text within character length boundaries and metadata
For each piece of text, create a new record with current version + 1 and schedule a worker which:
1. Update the record's embedding by making a single OpenAI API call
2. Check if all new records have been updated and if so, update the current version in Redis

We also schedule a cron worker to run daily which deletes previous records (version < current version).

Therefore we have 3 workers:

RecreateRecordsWorker: Cron at 05:00 UTC on weekdays (AMER night time)
UpdateWorker: scheduled by RecreateRecordsWorker
RemovePreviousRecordsWorker: Cron at 00:00 UTC daily. This only does a cleanup.

Calculations

OpenAI rate limits

Paid user limits for ada model: 3500 requests per minute OR 70m tokens per minute
Number of documents: ~12000
Retry limit for OpenAI client: 3
Retry for each job: 1
Maximum number of OpenAI calls per job (if each call fails on initial and retry): 6
Maximum number of OpenAI calls for all jobs: 6*12000 = 72000
Absolute minimum number of minutes = 72000/3500 = 21 minutes
Safe number of minutes: 90 minutes
Best case: 12000 calls / 90 minutes = 133 calls per minute
Worst case: 12000*6 calls / 90 minutes = 800 calls per minute

Ensuring that as the number of files scale, the rate limit is accounted for: convert calls per minute to files per minute: 1800 files / 90 minutes = 20 files/minute.

Max and Min character for text splitting

Max: 1500
Min: 100

The constraint is token length for the final prompt: the answers from each initial prompt + the question prompt + answer < 4097 tokens

Content that is too small will not have meaningful context so we cap it at 100 minimum. The 1500 max is based off the MVC that was done which had a balance between keeping context and allowing enough tokens for the final prompt.

These numbers should not be definitive and should be iterated on

Database review

Migrate: up

Query plans: (database is same as production)

::Embedding::TanukiBotMvc.current

Link to full query plan

::Embedding::TanukiBotMvc.previous

Link to full query plan

::Embedding::TanukiBotMvc.previous.limit(100).delete_all

Link to full query plan

::Embedding::TanukiBotMvc.nil_embeddings_for_version(version).exists?

Link to full query plan

How to set up and validate locally

You will need an OpenAI API key. NOTE: running over the entire dataset will make ~2000 API calls over 3 hours with less than ~30 calls per minute. Paid user limits for ada model: 3500 requests per minute.

Create the embedding db:

gdk config set pgvector.enabled true
gdk config set gitlab.rails.databases.embedding.enabled true
gdk reconfigure

Enable the feature flag: Feature.enable(:tanuki_bot_indexing)
Execute the coordinating job in a rails console: Llm::TanukiBot::RecreateRecordsWorker.new.perform
Disable the feature flag: Feature.disable(:tanuki_bot_indexing)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited May 19, 2023 by Madelein van Niekerk

Add Tanuki Bot indexer

What does this MR do and why?

Calculations

Max and Min character for text splitting

Database review

How to set up and validate locally

MR acceptance checklist

Merge request reports