Retry `BackfillIssueSearchData` batched background migration
Production Change
Change Summary
4 jobs in the BackfillIssueSearchData
migration failed due to statement timeouts.
Around ~800k issue records need to be updated by the migration.
gitlabhq_dblab=# SELECT COUNT(*), SUM(batch_size), status FROM batched_background_migration_jobs WHERE batched_background_migration_id = 116 GROUP BY status;
count | sum | status
-------+----------+--------
4 | 799293 | 2
254 | 58557962 | 3
(2 rows)
These can be retried from the admin panel: https://docs.gitlab.com/ee/update/index.html#check-the-status-of-batched-background-migrations
UPDATE: after a few retries, only ~91k records need to be updated. We want to lower the sub-batch size before we retry again to reduce the chance of timeouts.
Change Details
- Services Impacted - ServicePostgres ServiceGitLab Rails
- Change Technician - @gsgl
- Change Reviewer - @kwanyangu
- Time tracking - 20 minutes
- Downtime Component - N/A
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 10
-
In a Rails console:
Gitlab::Database::BackgroundMigration::BatchedJob.where(batched_background_migration_id: 116).with_status(:failed).count
# confirm that the above returns 2
Gitlab::Database::BackgroundMigration::BatchedJob.where(batched_background_migration_id: 116).with_status(:failed).update_all(sub_batch_size: 200)
The update query this generates is:
UPDATE "batched_background_migration_jobs"
SET "sub_batch_size" = 200
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 116
AND ("batched_background_migration_jobs"."status" IN (2))
Query plan on dblab:
Update on batched_background_migration_jobs (cost=0.29..3.31 rows=1 width=199) (actual time=15.755..15.756 rows=0 loops=1)
Buffers: shared hit=20 read=18 dirtied=7
I/O Timings: read=15.077
-> Index Scan using index_batched_jobs_on_batched_migration_id_and_status on batched_background_migration_jobs (cost=0.29..3.31 rows=1 width=199) (actual time=0.051..0.058 rows=2 loops=1)
Index Cond: ((batched_background_migration_id = 116) AND (status = 2))
Buffers: shared hit=9
Planning Time: 29.400 ms
Execution Time: 16.002 ms
(8 rows)
-
Go to http://gitlab.com/admin/background_migrations?tab=failed
-
Click "Retry" on the "BackfillIssueSearchData: issues" migration
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Verify that the migration is completed in http://gitlab.com/admin/background_migrations?tab=finished
Note: it may take a few minutes / hours for the jobs to finish
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Disable the execute_batched_migrations_on_schedule
feature flag
Note: I don't really expect this change to affect anything because 99% of this migration has already run on GitLab.com. But just in case there's any kind of DB degradation, we can disable background migrations.
Monitoring
Key metrics to observe
- Metric: General DB metrics
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1
- What changes to this metric should prompt a rollback: Any kind of DB degradation
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.