Skip to content

Add Tanuki Bot indexer

Terri Chu requested to merge tchu-tanuki-bot-indexer into master

What does this MR do and why?

Repopulates the Embedding::TanukiBotMvc table on a schedule. This is because the table is created from GitLab documentation which is constantly being updated.

Because every record needs an embedding which is determined via making an API call to OpenAI and there are 2000 records, it is not a good idea to drop and recreate everything in one transaction. Instead, we are opting for a spread-out approach using a new version field on the table:

We have 1 coordinating cron worker which:

  1. Reads every file (currently only the /doc/ directory but could be expanded)
  2. Parses the content of the file into text within character length boundaries and metadata
  3. For each piece of text, create a new record with current version + 1 and schedule a worker which:
    1. Update the record's embedding by making a single OpenAI API call
    2. Check if all new records have been updated and if so, update the current version in Redis

We also schedule a cron worker to run daily which deletes previous records (version < current version).

Therefore we have 3 workers:

  • RecreateRecordsWorker: Cron at 05:00 UTC on weekdays (AMER night time)
  • UpdateWorker: scheduled by RecreateRecordsWorker
  • RemovePreviousRecordsWorker: Cron at 00:00 UTC daily. This only does a cleanup.

Calculations

OpenAI rate limits

  • Paid user limits for ada model: 3500 requests per minute OR 70m tokens per minute
  • Number of documents: ~12000
  • Retry limit for OpenAI client: 3
  • Retry for each job: 1
  • Maximum number of OpenAI calls per job (if each call fails on initial and retry): 6
  • Maximum number of OpenAI calls for all jobs: 6*12000 = 72000
  • Absolute minimum number of minutes = 72000/3500 = 21 minutes
  • Safe number of minutes: 90 minutes
  • Best case: 12000 calls / 90 minutes = 133 calls per minute
  • Worst case: 12000*6 calls / 90 minutes = 800 calls per minute

Ensuring that as the number of files scale, the rate limit is accounted for: convert calls per minute to files per minute: 1800 files / 90 minutes = 20 files/minute.

Max and Min character for text splitting

  • Max: 1500
  • Min: 100

The constraint is token length for the final prompt: the answers from each initial prompt + the question prompt + answer < 4097 tokens

Content that is too small will not have meaningful context so we cap it at 100 minimum. The 1500 max is based off the MVC that was done which had a balance between keeping context and allowing enough tokens for the final prompt.

These numbers should not be definitive and should be iterated on

Database review

Migrate: up

Query plans: (database is same as production)

::Embedding::TanukiBotMvc.current

Link to full query plan

::Embedding::TanukiBotMvc.previous

Link to full query plan

::Embedding::TanukiBotMvc.previous.limit(100).delete_all

Link to full query plan

::Embedding::TanukiBotMvc.nil_embeddings_for_version(version).exists?

Link to full query plan

How to set up and validate locally

You will need an OpenAI API key. NOTE: running over the entire dataset will make ~2000 API calls over 3 hours with less than ~30 calls per minute. Paid user limits for ada model: 3500 requests per minute.

  1. Create the embedding db:
    gdk config set pgvector.enabled true
    gdk config set gitlab.rails.databases.embedding.enabled true
    gdk reconfigure
  2. Enable the feature flag: Feature.enable(:tanuki_bot_indexing)
  3. Execute the coordinating job in a rails console: Llm::TanukiBot::RecreateRecordsWorker.new.perform
  4. Disable the feature flag: Feature.disable(:tanuki_bot_indexing)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Madelein van Niekerk

Merge request reports