WIP: Improve resillency of queue_background_migration on large tables (!19744) · Merge requests · GitLab.org / GitLab FOSS

Kamil Trzciński requested to merge improve-queue-background-migration into master Jun 13, 2018

What does this MR do?

This tries to solve problem of running BackgroundMigrations of large tables, that has millions of rows, with a lot of data. Today, our background migration code can produce very wide ranges, which makes our queries to timeout. This MR tries to address that problem by introducing a smaller chunks.

The problem

We have builds table that has 10.000.000 rows,
We run an expensive query for the batch size of 1000, but since only the data in middle is migrated we hit a problem of very wide range: 1..1.000.000, but only 1000 rows are matching there,
This is problematic, as we have to traverse each of the 1.000.000 rows to figure out which rows do match, and in case of expensive queries with cross joins this often results in query timeouts,
The query can timeout twice when running db migration code, but also when running background migration code, as BG migration often replicates the same expensive query as db migration code,
It is almost impossible to optimize such queries with indexes, as they cannot help in complex scenarios with cross joins, like not exists.

Proposed solution

Use outer join to figure out a small range, the range size (default: 100k) on which we execute an expensive query,
This means that background migration code will be limited only on a small subset of data, and the wideness between start id and end id will not be larger than the range size,
We need less temporary indexes, and queries are faster to execute,
The bad side is that BG migration will not have an optimal size of batch size, mostly it will have slightly higher batch size, but not higher than 2xbatch_size-1, mostly it will be a little over batch_size.

Cost

It adds at least 2 extra cheap queries (that run on the index) for small tables (<=100k),
It adds at least 2*N*(rows_count/100k) extra cheap queries (that run on the index),
From the execution of db/post_migrate point, it should have a neglectable effect on performance,

Why was this MR needed?

To make queueing to be of predictable behavior for large tables, when we execute expensive queries. One of the cases is in: https://gitlab.com/gitlab-com/infrastructure/issues/4377.

Does this MR meet the acceptance criteria?

Changelog entry added, if necessary
Tests added for this feature/bug
Conform by the code review guidelines
- Has been reviewed by a Backend maintainer
- Has been reviewed by a Database specialist
Conform by the merge request performance guides
Conform by the style guides

What are the relevant issue numbers?

Closes https://gitlab.com/gitlab-com/infrastructure/issues/4377

Edited Jun 13, 2018 by Kamil Trzciński

WIP: Improve resillency of queue_background_migration on large tables