Slow down creation of embeddings DB records
What does this MR do and why?
Problem
We are seeing a lot of FailedToObtainLockError
exceptions on production https://log.gprd.gitlab.net/app/r/s/REZJW compared to very sporading ones in development or staging.
I assume that is because on production our sidekiq cluster has a very high concurrency rate, so we can consume jobs from queues at a high concurrency rate, which makes tens or hundreds of jobs that take way under a second complete for the same exclusive lease lock at the same time. On staging or local dev because we have for instance just a couple threads of concurrency available for sidekiq we are not seeing this side effect as much.
Proposed solution
Slow down creation of embeddings DB records by enqueueing jobs consecutively with a delay between them, so that even if we have high concurrency the scheduled time is delayed enough that only a few jobs are being run at the same time.
So this MR implements a slow down in scheduling creation of DB records
for embeddings. This helps prevent a raise in FailedToObtainLockError
exceptions on production where sidekiq cluster has a high concurrency
rate, which makes the exclusive lease lock reject creation of some
embeddings.
CreateDbEmbeddingsPerDocFileWorker
is scheduling SetEmbeddingsOnTheRecordWorker
by 1/EMBEDDINGS_PER_SECOND
SetEmbeddingsOnTheRecordWorker
implements the throttling logic so that if a rate limit is exceeded the jobs that exceed the rate limit will be rescheduled with a delay.
Data
This is the approximate embeddings distribution per file that we currently have:
DB data distribution for embeddings
gitlabhq_development_embedding=# select cnt, count(*) as snt2 from (select count(*) as cnt from vertex_gitlab_docs group by metadata->>'source') t group by cnt order by count(*) desc;
cnt | snt2
------+------
1 | 473
3 | 225
2 | 219
4 | 160
5 | 148
6 | 112
7 | 99
8 | 77
9 | 70
10 | 58
11 | 41
13 | 37
12 | 34
14 | 21
16 | 17
17 | 13
15 | 13
18 | 12
20 | 11
19 | 10
22 | 8
21 | 7
23 | 6
27 | 6
26 | 6
41 | 5
28 | 5
25 | 5
33 | 4
30 | 4
29 | 4
37 | 3
49 | 2
54 | 2
35 | 2
31 | 2
50 | 2
24 | 2
175 | 1
39 | 1
45 | 1
63 | 1
62 | 1
34 | 1
46 | 1
32 | 1
57 | 1
1238 | 1
85 | 1
70 | 1
43 | 1
116 | 1
40 | 1
121 | 1
44 | 1
59 | 1
36 | 1
60 | 1
52 | 1
(59 rows)
Case 1: From above data we can see there is a bunch of files with under 10 embeddings. Which means if there is a high concurrency rate of sidekiq consumers, e.g. 100 sidekiq threads, we can end up processing 100 files at once which would compete for the exclusive lock lease. With this change jobs are consecutively enqueued with a delay in between them based on the count of embeddings a given file would have. So if for instance we have 100 files with 1 embedding, we will schedule each SetEmbeddingsOnTheRecordWorker
~1/7 seconds apart, which should result in ~7req/sec to the embeddings API.
Case 2: Another possibility is processing the "big files". E.g. we have one file with ~1250 embeddings, another one with ~175 embeddings. Within CreateDbEmbeddingsPerDocFileWorker
we will schedule SetEmbeddingsOnTheRecordWorker
also at a ~7 jobs per second rate, so that we do not flood the API endpoint with a single file having a lot of embeddings.
Case 3: Another possibility is for instance when a lot of job were scheduled but did not run for whatever reason. Or there was a short delay, e.g 10 sidekiq was down for a few minutes, so a lot of jobs are now scheduled to run in the past. These jobs will not be picked up as soon as a sidekiq client is available to run these, which can also result in flooding the embeddings API. For this case I've added the throttling mechanism !132901 (diffs), which would re-schedule the jobs that exceed the 450 jobs/minute rate at later time. I thought of potentially removing the delayed scheduling from CreateDbEmbeddingsPerDocFileWorker
and just rely on rate limiter. However I think this mechanism is good to have as a back-up as it add extra weight and overhead on sidekiq to run and re-schedule a lot of jobs in a short period. If we remove the delay from CreateDbEmbeddingsPerDocFileWorker
we'll be scheduling ~15K SetEmbeddingsOnTheRecordWorker
jobs in ~1 minute, so basically all those jobs will run once just to be re-scheduled again.
Here is the update update distribution per minute
gitlabhq_development_embedding=# SELECT COUNT(*) cnt, to_timestamp(floor((extract('epoch' from updated_at) / 60 )) * 60) AT TIME ZONE 'UTC' as interval_alias FROM vertex_gitlab_docs where version = 1 and embedding is not null GROUP BY interval_alias order by interval_alias;
cnt | interval_alias
-----+---------------------
93 | 2023-09-28 13:36:00
327 | 2023-09-28 13:37:00
350 | 2023-09-28 13:38:00
350 | 2023-09-28 13:39:00
280 | 2023-09-28 13:40:00
350 | 2023-09-28 13:41:00
350 | 2023-09-28 13:42:00
350 | 2023-09-28 13:43:00
350 | 2023-09-28 13:44:00
350 | 2023-09-28 13:45:00
350 | 2023-09-28 13:46:00
350 | 2023-09-28 13:47:00
349 | 2023-09-28 13:48:00
281 | 2023-09-28 13:49:00
350 | 2023-09-28 13:50:00
325 | 2023-09-28 13:51:00
375 | 2023-09-28 13:52:00
350 | 2023-09-28 13:53:00
350 | 2023-09-28 13:54:00
280 | 2023-09-28 13:55:00
350 | 2023-09-28 13:56:00
328 | 2023-09-28 13:57:00
372 | 2023-09-28 13:58:00
302 | 2023-09-28 13:59:00
329 | 2023-09-28 14:00:00
379 | 2023-09-28 14:01:00
350 | 2023-09-28 14:02:00
292 | 2023-09-28 14:03:00
350 | 2023-09-28 14:04:00
310 | 2023-09-28 14:05:00
320 | 2023-09-28 14:06:00
350 | 2023-09-28 14:07:00
280 | 2023-09-28 14:08:00
350 | 2023-09-28 14:09:00
350 | 2023-09-28 14:10:00
350 | 2023-09-28 14:11:00
348 | 2023-09-28 14:12:00
281 | 2023-09-28 14:13:00
378 | 2023-09-28 14:14:00
285 | 2023-09-28 14:15:00
353 | 2023-09-28 14:16:00
332 | 2023-09-28 14:17:00
349 | 2023-09-28 14:18:00
357 | 2023-09-28 14:19:00
280 | 2023-09-28 14:20:00
348 | 2023-09-28 14:21:00
280 | 2023-09-28 14:22:00
337 | 2023-09-28 14:23:00
(48 rows)
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
Before | After |
---|---|
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.