Slow down creation of embeddings DB records (!132901) · Merge requests · GitLab.org / GitLab

Alexandru Croitor requested to merge address_exclusive_lock_issue_with_vertex_embeddings into master Sep 28, 2023

What does this MR do and why?

Problem

We are seeing a lot of FailedToObtainLockError exceptions on production https://log.gprd.gitlab.net/app/r/s/REZJW compared to very sporading ones in development or staging.

I assume that is because on production our sidekiq cluster has a very high concurrency rate, so we can consume jobs from queues at a high concurrency rate, which makes tens or hundreds of jobs that take way under a second complete for the same exclusive lease lock at the same time. On staging or local dev because we have for instance just a couple threads of concurrency available for sidekiq we are not seeing this side effect as much.

Proposed solution

Slow down creation of embeddings DB records by enqueueing jobs consecutively with a delay between them, so that even if we have high concurrency the scheduled time is delayed enough that only a few jobs are being run at the same time.

So this MR implements a slow down in scheduling creation of DB records for embeddings. This helps prevent a raise in FailedToObtainLockError exceptions on production where sidekiq cluster has a high concurrency rate, which makes the exclusive lease lock reject creation of some embeddings.

CreateDbEmbeddingsPerDocFileWorker is scheduling SetEmbeddingsOnTheRecordWorker by 1/EMBEDDINGS_PER_SECOND SetEmbeddingsOnTheRecordWorker implements the throttling logic so that if a rate limit is exceeded the jobs that exceed the rate limit will be rescheduled with a delay.

Data

This is the approximate embeddings distribution per file that we currently have:

DB data distribution for embeddings

gitlabhq_development_embedding=# select cnt, count(*) as snt2 from (select count(*) as cnt from vertex_gitlab_docs group by metadata->>'source') t group by cnt order by count(*) desc;
 cnt  | snt2
------+------
    1 |  473
    3 |  225
    2 |  219
    4 |  160
    5 |  148
    6 |  112
    7 |   99
    8 |   77
    9 |   70
   10 |   58
   11 |   41
   13 |   37
   12 |   34
   14 |   21
   16 |   17
   17 |   13
   15 |   13
   18 |   12
   20 |   11
   19 |   10
   22 |    8
   21 |    7
   23 |    6
   27 |    6
   26 |    6
   41 |    5
   28 |    5
   25 |    5
   33 |    4
   30 |    4
   29 |    4
   37 |    3
   49 |    2
   54 |    2
   35 |    2
   31 |    2
   50 |    2
   24 |    2
  175 |    1
   39 |    1
   45 |    1
   63 |    1
   62 |    1
   34 |    1
   46 |    1
   32 |    1
   57 |    1
 1238 |    1
   85 |    1
   70 |    1
   43 |    1
  116 |    1
   40 |    1
  121 |    1
   44 |    1
   59 |    1
   36 |    1
   60 |    1
   52 |    1
(59 rows)

Case 1: From above data we can see there is a bunch of files with under 10 embeddings. Which means if there is a high concurrency rate of sidekiq consumers, e.g. 100 sidekiq threads, we can end up processing 100 files at once which would compete for the exclusive lock lease. With this change jobs are consecutively enqueued with a delay in between them based on the count of embeddings a given file would have. So if for instance we have 100 files with 1 embedding, we will schedule each SetEmbeddingsOnTheRecordWorker ~1/7 seconds apart, which should result in ~7req/sec to the embeddings API.

Case 2: Another possibility is processing the "big files". E.g. we have one file with ~1250 embeddings, another one with ~175 embeddings. Within CreateDbEmbeddingsPerDocFileWorker we will schedule SetEmbeddingsOnTheRecordWorker also at a ~7 jobs per second rate, so that we do not flood the API endpoint with a single file having a lot of embeddings.

Case 3: Another possibility is for instance when a lot of job were scheduled but did not run for whatever reason. Or there was a short delay, e.g 10 sidekiq was down for a few minutes, so a lot of jobs are now scheduled to run in the past. These jobs will not be picked up as soon as a sidekiq client is available to run these, which can also result in flooding the embeddings API. For this case I've added the throttling mechanism !132901 (diffs), which would re-schedule the jobs that exceed the 450 jobs/minute rate at later time. I thought of potentially removing the delayed scheduling from CreateDbEmbeddingsPerDocFileWorker and just rely on rate limiter. However I think this mechanism is good to have as a back-up as it add extra weight and overhead on sidekiq to run and re-schedule a lot of jobs in a short period. If we remove the delay from CreateDbEmbeddingsPerDocFileWorker we'll be scheduling ~15K SetEmbeddingsOnTheRecordWorker jobs in ~1 minute, so basically all those jobs will run once just to be re-scheduled again.

Here is the update update distribution per minute

gitlabhq_development_embedding=# SELECT COUNT(*) cnt, to_timestamp(floor((extract('epoch' from updated_at) / 60 )) * 60) AT TIME ZONE 'UTC' as interval_alias FROM vertex_gitlab_docs where version = 1 and embedding is not null GROUP BY interval_alias order by interval_alias;
 cnt |   interval_alias
-----+---------------------
  93 | 2023-09-28 13:36:00
 327 | 2023-09-28 13:37:00
 350 | 2023-09-28 13:38:00
 350 | 2023-09-28 13:39:00
 280 | 2023-09-28 13:40:00
 350 | 2023-09-28 13:41:00
 350 | 2023-09-28 13:42:00
 350 | 2023-09-28 13:43:00
 350 | 2023-09-28 13:44:00
 350 | 2023-09-28 13:45:00
 350 | 2023-09-28 13:46:00
 350 | 2023-09-28 13:47:00
 349 | 2023-09-28 13:48:00
 281 | 2023-09-28 13:49:00
 350 | 2023-09-28 13:50:00
 325 | 2023-09-28 13:51:00
 375 | 2023-09-28 13:52:00
 350 | 2023-09-28 13:53:00
 350 | 2023-09-28 13:54:00
 280 | 2023-09-28 13:55:00
 350 | 2023-09-28 13:56:00
 328 | 2023-09-28 13:57:00
 372 | 2023-09-28 13:58:00
 302 | 2023-09-28 13:59:00
 329 | 2023-09-28 14:00:00
 379 | 2023-09-28 14:01:00
 350 | 2023-09-28 14:02:00
 292 | 2023-09-28 14:03:00
 350 | 2023-09-28 14:04:00
 310 | 2023-09-28 14:05:00
 320 | 2023-09-28 14:06:00
 350 | 2023-09-28 14:07:00
 280 | 2023-09-28 14:08:00
 350 | 2023-09-28 14:09:00
 350 | 2023-09-28 14:10:00
 350 | 2023-09-28 14:11:00
 348 | 2023-09-28 14:12:00
 281 | 2023-09-28 14:13:00
 378 | 2023-09-28 14:14:00
 285 | 2023-09-28 14:15:00
 353 | 2023-09-28 14:16:00
 332 | 2023-09-28 14:17:00
 349 | 2023-09-28 14:18:00
 357 | 2023-09-28 14:19:00
 280 | 2023-09-28 14:20:00
 348 | 2023-09-28 14:21:00
 280 | 2023-09-28 14:22:00
 337 | 2023-09-28 14:23:00
(48 rows)

## Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before	After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Oct 03, 2023 by Alexandru Croitor

Slow down creation of embeddings DB records