Retry `BackfillIssueSearchData` batched background migration

Production Change

Change Summary

4 jobs in the BackfillIssueSearchData migration failed due to statement timeouts.

Around ~800k issue records need to be updated by the migration.

gitlabhq_dblab=# SELECT COUNT(*), SUM(batch_size), status FROM batched_background_migration_jobs WHERE batched_background_migration_id = 116 GROUP BY status;
 count |   sum    | status
-------+----------+--------
     4 |   799293 |      2
   254 | 58557962 |      3
(2 rows)

These can be retried from the admin panel: https://docs.gitlab.com/ee/update/index.html#check-the-status-of-batched-background-migrations

UPDATE: after a few retries, only ~91k records need to be updated. We want to lower the sub-batch size before we retry again to reduce the chance of timeouts.

Change Details

Services Impacted - ServicePostgres ServiceGitLab Rails
Change Technician - @gsgl
Change Reviewer - @kwanyangu
Time tracking - 20 minutes
Downtime Component - N/A

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 10

In a Rails console:

Gitlab::Database::BackgroundMigration::BatchedJob.where(batched_background_migration_id: 116).with_status(:failed).count
# confirm that the above returns 2
Gitlab::Database::BackgroundMigration::BatchedJob.where(batched_background_migration_id: 116).with_status(:failed).update_all(sub_batch_size: 200)

The update query this generates is:

UPDATE "batched_background_migration_jobs" 
SET "sub_batch_size" = 200
WHERE "batched_background_migration_jobs"."batched_background_migration_id" = 116
  AND ("batched_background_migration_jobs"."status" IN (2))

Query plan on dblab:

 Update on batched_background_migration_jobs  (cost=0.29..3.31 rows=1 width=199) (actual time=15.755..15.756 rows=0 loops=1)
   Buffers: shared hit=20 read=18 dirtied=7
   I/O Timings: read=15.077
   ->  Index Scan using index_batched_jobs_on_batched_migration_id_and_status on batched_background_migration_jobs  (cost=0.29..3.31 rows=1 width=199) (actual time=0.051..0.058 rows=2 loops=1)
         Index Cond: ((batched_background_migration_id = 116) AND (status = 2))
         Buffers: shared hit=9
 Planning Time: 29.400 ms
 Execution Time: 16.002 ms
(8 rows)

Go to http://gitlab.com/admin/background_migrations?tab=failed
Click "Retry" on the "BackfillIssueSearchData: issues" migration

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5

Verify that the migration is completed in http://gitlab.com/admin/background_migrations?tab=finished

Note: it may take a few minutes / hours for the jobs to finish

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Disable the execute_batched_migrations_on_schedule feature flag

Note: I don't really expect this change to affect anything because 99% of this migration has already run on GitLab.com. But just in case there's any kind of DB degradation, we can disable background migrations.

Monitoring

Key metrics to observe

Metric: General DB metrics
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1
- What changes to this metric should prompt a rollback: Any kind of DB degradation

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Mar 16, 2022 by Gonzalo Servat