Skip to content

WIP: Improve resillency of queue_background_migration on large tables

What does this MR do?

This tries to solve problem of running BackgroundMigrations of large tables, that has millions of rows, with a lot of data. Today, our background migration code can produce very wide ranges, which makes our queries to timeout. This MR tries to address that problem by introducing a smaller chunks.

The problem

  1. We have builds table that has 10.000.000 rows,
  2. We run an expensive query for the batch size of 1000, but since only the data in middle is migrated we hit a problem of very wide range: 1..1.000.000, but only 1000 rows are matching there,
  3. This is problematic, as we have to traverse each of the 1.000.000 rows to figure out which rows do match, and in case of expensive queries with cross joins this often results in query timeouts,
  4. The query can timeout twice when running db migration code, but also when running background migration code, as BG migration often replicates the same expensive query as db migration code,
  5. It is almost impossible to optimize such queries with indexes, as they cannot help in complex scenarios with cross joins, like not exists.

Proposed solution

  1. Use outer join to figure out a small range, the range size (default: 100k) on which we execute an expensive query,
  2. This means that background migration code will be limited only on a small subset of data, and the wideness between start id and end id will not be larger than the range size,
  3. We need less temporary indexes, and queries are faster to execute,
  4. The bad side is that BG migration will not have an optimal size of batch size, mostly it will have slightly higher batch size, but not higher than 2xbatch_size-1, mostly it will be a little over batch_size.

Cost

  1. It adds at least 2 extra cheap queries (that run on the index) for small tables (<=100k),
  2. It adds at least 2*N*(rows_count/100k) extra cheap queries (that run on the index),
  3. From the execution of db/post_migrate point, it should have a neglectable effect on performance,

Why was this MR needed?

To make queueing to be of predictable behavior for large tables, when we execute expensive queries. One of the cases is in: https://gitlab.com/gitlab-com/infrastructure/issues/4377.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Closes https://gitlab.com/gitlab-com/infrastructure/issues/4377

Edited by Kamil Trzciński | At GitLab Summit

Merge request reports