database migration issues in large instances upgrading to 14.5 or later (merge_request_diff_commits, Command timed out after 3600s)
Summary
A support ticket was raised for an Omnibus GitLab upgrade from 14.1 to 14.7 that failed as follows. GitLab team members can read more in the ticket
FATAL: Mixlib::ShellOut::CommandTimeout: rails_migration[gitlab-rails]
(gitlab::database_migrations line 51) had an error: Mixlib::ShellOut::CommandTimeout:
bash[migrate gitlab-rails database]
(/opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab/resources/rails_migration.rb line 16)
had an error: Mixlib::ShellOut::CommandTimeout: Command timed out after 3600s:
See omnibus-gitlab#6677 (closed) for how to workaround this in general, but note that if database migrations exceeded 60 minutes, their full run time may be a lot longer, such as: over 30 hours.
This issue is about the code in a 14.5 migration (20211012134316, class CleanUpMigrateMergeRequestDiffCommitUsers) which executes pending MigrateMergeRequestDiffCommitUsers background migrations.
Related:
CleanUpMigrateMergeRequestDiffCommitUsers issue and MR:
Steps to reproduce
- Upgrade to GitLab 14.1
- Not all
MigrateMergeRequestDiffCommitUsersbatches complete. - Upgrade to 14.5 or later.
Instance has to be large enough for the batches to exceed the one hour timeout.
What is the current bug behavior?
Upgrade fails as it's assumed that
self-hosted instances should have their migrations finished a long time ago.
However, self-managed instances
- Can observe the same issues that occurred on GitLab.com.
- Don't all follow zero downtime upgrades.
Instances which don't upgrade to 14.3, and wait for the background migrations to complete, will not get the benefit of the fix that GitLab.com received in 14.3 (plus the manual work to complete batches using the 14.1 code.)
What is the expected correct behavior?
Upgrade completes. However, this issue is about documenting how to diagnose and resolve this issue.
Possible fixes and workarounds.
Summary
More than likely, fix forward will be the required approach.
If it's possible to detect this proactively, and avoid a fix forwards, then
- Establish that an instance is likely to have this issue. Either by assessing the size of the database, or by discovering outstanding
MigrateMergeRequestDiffCommitUsersbatches on a 14.1 or 14.2 instance. - Upgrade to 14.3
- Wait for all batches in the revised
MigrateMergeRequestDiffCommitUsers14.3 migration to complete - Upgrade to a later release.
The 14.1 / 14.3 migrations have to complete before the 14.5 migration can run. So, once an instance is upgrading to 14.5, the only options are
- Back out
- fix forward
Fix forward.
If GitLab is working OK:
-
Run:
sudo gitlab-rake db:migrate -
Wait. This may take a long time. The customer tried this, and cancelled it after 31 hours.
If GitLab is not working correctly:
For example if merge request approval rules and merging is broken, which we've had reported to GitLab support. (Link for team members).
This is caused by database changes being queued up behind the long-running merge_request_diff_commits changes, and the GitLab code being unable to run correctly with the current state of the database.
-
Set aside the problematic migrations
sudo gitlab-rake gitlab:db:mark_migration_complete[20211012134316] sudo gitlab-rake gitlab:db:mark_migration_complete[20211012143815] -
Run all the rest of the outstanding migrations.
sudo gitlab-rake db:migrate -
Back out the first step, via the PostgreSQL database console (
sudo gitlab-psql)DELETE FROM schema_migrations WHERE version IN ('20211012134316', '20211012143815'); -
Run these two migrations. This will take a long time.
sudo gitlab-rake db:migrate -
Please comment on the issue about your success, or otherwise, with this workaround.
Proactive
Database size assessment.
-
The relevant table is
merge_request_diff_commits -
Run a database console
sudo gitlab-psql -
Query:
select n.nspname as table_schema, c.relname as table_name, c.reltuples as rows from pg_class c join pg_namespace n on n.oid = c.relnamespace where c.relkind = 'r' and n.nspname not in ('information_schema','pg_catalog') order by c.reltuples desc limit 50; -
On the ticket, the
merge_request_diff_commitswas around 9 million rows9.123456e+06, the sixth largest unique table (ignoringweb_hook_logs_archived`) and only ten tables had more than 1 million rows.
Check for pending 14.1 batches
- run in the database console;
sudo gitlab-psql
select status, count(*) from background_migration_jobs
where class_name = 'MigrateMergeRequestDiffCommitUsers' group by status;
- when completed, the batches (eight in total) will move from status
0to status1. Here, only 25% are complete.
status | count
--------+-------
0 | 6
1 | 2
GitLab 14.3
!68769 (merged) was introduced to resolve the final 12% of the batches on GitLab.com.
Migration 20210901153324 is added; it reduces the batch size and makes other optimizations.
This can be used on Self managed instances to complete the migrations before upgrading to 14.5 or later.
All "failed" batches from 14.1 will be cancelled, and the work rescheduled by the new migration.
Use the same query to monitor the work
- run in the database console;
sudo gitlab-psql
select status, count(*) from background_migration_jobs
where class_name = 'MigrateMergeRequestDiffCommitUsers' group by status;
- when completed, the batches will move from status
0to status1. - Work still to do:
status | count
--------+-------
0 | 16
1 | 8
- All done
status | count
--------+-------
1 | 24
Monitor the GitLab 14.5 migration
There is a 14.5 migration referred to in the upgrade notes.
Optionally, monitor it with the following query; the batches will be complete when all records have a status of 1.
select status, count(*) from background_migration_jobs
where class_name = 'FixMergeRequestDiffCommitUsers' group by status;