WIP: Improve resillency of queue_background_migration on large tables
What does this MR do?
This tries to solve problem of running BackgroundMigrations of large tables, that has millions of rows, with a lot of data. Today, our background migration code can produce very wide ranges, which makes our queries to timeout. This MR tries to address that problem by introducing a smaller chunks.
The problem
- We have
builds
table that has10.000.000
rows, - We run an expensive query for the batch size of 1000, but since only the data in middle is migrated we hit a problem of very wide range:
1..1.000.000
, but only 1000 rows are matching there, - This is problematic, as we have to traverse each of the
1.000.000
rows to figure out which rows do match, and in case of expensive queries with cross joins this often results in query timeouts, - The query can timeout twice when running db migration code, but also when running background migration code, as BG migration often replicates the same expensive query as db migration code,
- It is almost impossible to optimize such queries with indexes, as they cannot help in complex scenarios with cross joins, like not exists.
Proposed solution
- Use outer join to figure out a small range, the range size (default: 100k) on which we execute an expensive query,
- This means that background migration code will be limited only on a small subset of data, and the wideness between
start id
andend id
will not be larger than the range size, - We need less temporary indexes, and queries are faster to execute,
- The bad side is that BG migration will not have an optimal size of batch size, mostly it will have slightly higher batch size, but not higher than
2xbatch_size-1
, mostly it will be a little overbatch_size
.
Cost
- It adds at least 2 extra cheap queries (that run on the index) for small tables (<=100k),
- It adds at least
2*N*(rows_count/100k)
extra cheap queries (that run on the index), - From the execution of
db/post_migrate
point, it should have a neglectable effect on performance,
Why was this MR needed?
To make queueing to be of predictable behavior for large tables, when we execute expensive queries. One of the cases is in: https://gitlab.com/gitlab-com/infrastructure/issues/4377.
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Tests added for this feature/bug - Conform by the code review guidelines
-
Has been reviewed by a Backend maintainer -
Has been reviewed by a Database specialist
-
-
Conform by the merge request performance guides -
Conform by the style guides
What are the relevant issue numbers?
Closes https://gitlab.com/gitlab-com/infrastructure/issues/4377
Edited by Kamil Trzciński