Background migrations cause statement timeouts if there is a huge gap in a batch
Description
Background migrations is a great tool to migrate data from an old structure to a new structure, and it works asynchronously, so rake db:migrate
is much faster than regular post-migrations.
However, there is a problem with big tables. If there is a huge gap between minimal ID and maximum ID, background migrations will likely fail by statement timeouts.
In fact, recently we had three such cases;
- Ci::Stage migration => 4 million gap in a batch
- Legacy artifacts migration => 2 million gap in a batch
- Legacy job traces migration => We don't have a extact number, but probably 10-20 million gap in a batch
We were dealing with those problems by adding database indexes, and it worked at some points, but selecting from a huge dataset naturally takes time, and thus still there is a chance of timeouts.
Data on gitlab.com is growing rapidly. Probably ci_builds
reach 100 million rows by the end of this year.
We need a long-term solution to let us easily and safely migrate data.
Proposal
There are some proposals from the discussions on slack.
- Splitting by a range => https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/19744)
- Having
background_migrations
table to track the background migration status, and repeatedly migrate data with cron worker - Chaining sidekiq jobs by passing arguments from job to job (This requires Fixing unreliable sidekiq queue)