Adjust batch_size, pause_ms and sub_batch_size of NullifyOrphanRunnerIdOnCiBuilds migration

Production Change

Change Summary

ref: gitlab-org/gitlab!81410 (comment 874179914)

Similar to #6531 (closed), we need to adjust NullifyOrphanRunnerIdOnCiBuilds batch parameters.

In GitLab.com, migration jobs took more than 2 minutes with 100k batch size and 1k sub batch size. Also initiated an incident #6581 (closed)

So, we decided to start lower than those parameters.

  • Decrease batch_size to 50_000 to get an immediate effect so that it adjusts to going higher automatically.
  • Increase pause_ms to 200.
  • Decrease sub_batch_size to 500.

Change Details

  1. Services Impacted - GitLab Rails app
  2. Change Technician - @ahegyi
  3. Change Reviewer - @ahegyi, @pbair
  4. Time tracking - Time, in minutes, needed to execute all change steps, including rollback
  5. Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change

Run this command in the Rails console;

 Gitlab::Database::BackgroundMigration::BatchedMigration.find(118).update!(batch_size: 50_000, pause_ms: 200, sub_batch_size: 500)

Post-Change Steps - steps to take to verify the change

Do we need to do anything with the failed jobs?

SELECT bm.max_value, bmj.max_value, bmj.batch_size, bmj.status
FROM batched_background_migration_jobs bmj
INNER JOIN batched_background_migrations bm ON bm.id = bmj.batched_background_migration_id
WHERE batched_background_migration_id = 118 AND bmj.status <> 3
ORDER BY bmj.id DESC;

 max_value  | max_value | batch_size | status
------------+-----------+------------+--------
 2161763599 |  11213813 |     128657 |      1
 2161763599 |   7120052 |     142500 |      2
 2161763599 |   6315007 |     150000 |      2
 2161763599 |   3571406 |     150000 |      2
 2161763599 |   3400075 |     150000 |      2
 2161763599 |   3229841 |     150000 |      2
 2161763599 |   3063179 |     150000 |      2
 2161763599 |   2894639 |     150000 |      2
 2161763599 |   1413464 |     100000 |      2
(9 rows)

Monitoring

Key metrics to observe

These changes will not affect the system immediately. The workspaces team and ahegyi will monitor the background migration execution.

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

  • The scheduled day and time of execution of the change is appropriate.
  • The change plan is technically accurate.
  • The change plan includes estimated timing values based on previous testing.
  • The change plan includes a viable rollback plan.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
  • The change plan includes success measures for all steps/milestones during the execution.
  • The change adequately minimizes risk within the environment/service.
  • The performance implications of executing the change are well-understood and documented.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
  • The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Adam Hegyi