Add nullify job for orphan runner_id columns of ci_builds
What does this MR do and why?
When a runner gets deleted, jobs that were run with it still have runner_id
. This MR adds a background migration job to nullify old orphan runner_id
columns of ci_builds
.
There will be 3 steps (https://docs.gitlab.com/ee/development/database/add_foreign_key_to_existing_column.html);
- Add a NOT VALID foreign key constraint to the column to ensure GitLab doesn't create inconsistent records. <-- !80203 (merged)
- Add a data migration, to fix or clean up existing records. <-- You're here
- Validate the whole table by making the foreign key VALID. <-- will be in 14.10
This is the 2nd step.
Related comment: #30159 (comment 836146520)
Database
Up
== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: migrating ==========
== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: migrated (0.0513s) =
Down
== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: reverting ==========
== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: reverted (0.0000s) =
Update Query
UPDATE "ci_builds"
SET "runner_id" = NULL, "lock_version" = COALESCE("lock_version", 0) + 1
WHERE "ci_builds"."id" IN (
SELECT "ci_builds"."id"
FROM "ci_builds"
LEFT OUTER JOIN ci_runners ON ci_runners.id = ci_builds.runner_id
WHERE (ci_builds.runner_id IS NOT NULL AND ci_runners.id IS NULL)
AND "ci_builds"."id" BETWEEN 1 AND 1000
)
Total Time
- We have ~2B
ci_builds
rows. - With a 100k batch size, we'll have 20k batches.
- (20k * 2 min interval) / 60 min = ~666 hours.
- 666 hours / 24 = ~27 days.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #30159 (closed)
Merge request reports
Activity
changed milestone to %14.10
assigned to @furkanayhan
added database label
added backend label
marked the checklist item I have evaluated the MR acceptance checklist for this MR. as completed
changed milestone to %14.9
- A deleted user
added databasereview pending label
1 Warning New migrations added but db/structure.sql wasn't updated Usually, when adding new migrations, db/structure.sql should be
updated too (unless the migration isn't changing the DB schema
and isn't the most recent one).Reviewer roulette
Changes that require review have been detected!
Please refer to the table below for assigning reviewers and maintainers suggested by Danger in the specified category:
Category Reviewer Maintainer backend Jason Goodman ( @jagood
) (UTC-5, 8 hours behind@furkanayhan
)Ethan Urie ( @eurie
) (UTC-5, 8 hours behind@furkanayhan
)database Vitali Tatarintev ( @ck3g
) (UTC+1, 2 hours behind@furkanayhan
)Steve Abrams ( @sabrams
) (UTC-7, 10 hours behind@furkanayhan
)~migration No reviewer available No maintainer available To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot, based on their timezone. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.
Generated by
Danger@morefice Can you please do the first review? (
@ahegyi
for the maintainer review )requested review from @morefice
- Resolved by Furkan Ayhan
- Resolved by Furkan Ayhan
- Resolved by Max Orefice
- Resolved by Furkan Ayhan
Thanks, great work with this new migration
Left some questions, let me know what you think, back to you
removed review request for @morefice
requested review from @morefice
Database migrations
Migrations included in this change have been executed on gitlab.com data for testing purposes. For details, please see the migration testing pipeline (limited access).
Migration Type Total runtime Result DB size change 20220223112304 - ScheduleNullifyOrphanRunnerIdOnCiBuilds Post deploy 1.5 s +0.00 B Runtime Histogram for all migrations
Query Runtime Count 0 seconds - 0.01 seconds 0 0.01 seconds - 0.1 seconds 3 0.1 seconds - 1 second 1 1 second - 5 minutes 0 5 minutes + 0 Migration: 20220223112304 - ScheduleNullifyOrphanRunnerIdOnCiBuilds
- Type: Post deploy
- Duration: 1.5 s
- Database size change: +0.00 B
Query Calls Total Time Max Time Mean Time Rows INSERT INTO "batched_background_migrations" ("created_at", "updated_at", "max_value", "batch_size", "sub_batch_size", "interval", "status", "job_class_name", "table_name", "column_name", "job_arguments", "total_tuple_count", "max_batch_size") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id" /*application:test,db_config_name:main*/
1 4.7 ms 4.7 ms 4.7 ms 1 SELECT MAX("id")
FROM "ci_builds"
/*application:test,db_config_name:main*/1 2.4 ms 2.4 ms 2.4 ms 1 SELECT $1 AS one
FROM "batched_background_migrations"
WHERE "batched_background_migrations"."job_class_name" = $2 AND "batched_background_migrations"."table_name" = $3 AND "batched_background_migrations"."column_name" = $4 AND (job_arguments = $5)
LIMIT $6 /*application:test,db_config_name:main*/1 0.7 ms 0.7 ms 0.7 ms 0 SELECT $1 AS one
FROM "batched_background_migrations"
WHERE "batched_background_migrations"."job_arguments" = $2 AND "batched_background_migrations"."job_class_name" = $3 AND "batched_background_migrations"."table_name" = $4 AND "batched_background_migrations"."column_name" = $5
LIMIT $6 /*application:test,db_config_name:main*/1 0.0 ms 0.0 ms 0.0 ms 0 Histogram for ScheduleNullifyOrphanRunnerIdOnCiBuilds
Query Runtime Count 0 seconds - 0.01 seconds 0 0.01 seconds - 0.1 seconds 3 0.1 seconds - 1 second 1 1 second - 5 minutes 0 5 minutes + 0
Other migrations pending on GitLab.com
Migration Type Total runtime Result DB size change Clone Details
Clone ID Clone Created At Clone Data Timestamp Expected Removal Time database-testing-1070689
2022-03-03 06:44:46 UTC 2022-03-03 04:10:00 UTC 2022-03-03 18:45:42 +0000 Artifacts
Brought to you by gitlab-org/database-team/gitlab-com-database-testing. Epic
- A deleted user
added database-testing-automation label
requested review from @ahegyi
added databasereviewed label and removed database-testing-automation databasereview pending labels
removed review request for @morefice
@morefice
, thanks for approving this merge request.This is the first time the merge request is approved. To ensure full test coverage, a new pipeline has been started.
For more info, please refer to the following links:
- Resolved by Adam Hegyi
- Resolved by Adam Hegyi
added 442 commits
-
7c3e55a8...a88a243e - 440 commits from branch
master
- 044ec352 - Add nullify job for orphan runner_id columns of ci_builds
- 087608da - Remove condition from the batch
-
7c3e55a8...a88a243e - 440 commits from branch
mentioned in issue #30159 (closed)
added databaseapproved label and removed databasereviewed label
- Resolved by Furkan Ayhan
@grzesiek Can you please review this MR? Do you think ~27 days is an acceptable time for background migrations?
requested review from @grzesiek
- Resolved by Adam Hegyi
@furkanayhan the pipeline is failing here, but the failure seems to be unrelated. Can we double check that? Is it otherwise ready for merge?
removed review request for @grzesiek
- A deleted user
added database-testing-automation label
added 232 commits
-
e0a455e5...5996141e - 228 commits from branch
master
- fcac5e3e - Add nullify job for orphan runner_id columns of ci_builds
- 508e14ba - Remove condition from the batch
- 12fdb0fa - Increase the batch size
- 2cab634b - Move to queue_batched_background_migration
Toggle commit list-
e0a455e5...5996141e - 228 commits from branch
added 406 commits
-
2cab634b...f2dfcdc6 - 402 commits from branch
master
- 120c2968 - Add nullify job for orphan runner_id columns of ci_builds
- 6c250d18 - Remove condition from the batch
- 21f9f4f6 - Increase the batch size
- b3d5a9c8 - Move to queue_batched_background_migration
Toggle commit list-
2cab634b...f2dfcdc6 - 402 commits from branch
enabled an automatic merge when the pipeline for d50b9d31 succeeds
mentioned in commit 178b2725
added workflowstaging-canary label
added workflowstaging label and removed workflowstaging-canary label
added workflowcanary label and removed workflowstaging label
added workflowproduction label and removed workflowcanary label
mentioned in issue gitlab-com/gl-infra/production#6581 (closed)
From @acroitor's comment;
SELECT bm.max_value, bmj.max_value, bmj.batch_size FROM batched_background_migration_jobs bmj INNER JOIN batched_background_migrations bm ON bm.id = bmj.batched_background_migration_id WHERE batched_background_migration_id = 118 ORDER BY bmj.id DESC LIMIT 10; max_value | max_value | batch_size ------------+-----------+------------ 2161763599 | 11213813 | 128657 2161763599 | 11073331 | 132720 2161763599 | 10928943 | 142500 2161763599 | 10775360 | 150000 2161763599 | 10611543 | 150000 2161763599 | 10448444 | 150000 2161763599 | 10275265 | 150000 2161763599 | 10109116 | 150000 2161763599 | 9944673 | 150000 2161763599 | 9778060 | 150000 (10 rows)
SELECT bm.max_value, bmj.max_value, bmj.batch_size, bmj.status FROM batched_background_migration_jobs bmj INNER JOIN batched_background_migrations bm ON bm.id = bmj.batched_background_migration_id WHERE batched_background_migration_id = 118 AND bmj.status <> 3 ORDER BY bmj.id DESC; max_value | max_value | batch_size | status ------------+-----------+------------+-------- 2161763599 | 11213813 | 128657 | 1 2161763599 | 7120052 | 142500 | 2 2161763599 | 6315007 | 150000 | 2 2161763599 | 3571406 | 150000 | 2 2161763599 | 3400075 | 150000 | 2 2161763599 | 3229841 | 150000 | 2 2161763599 | 3063179 | 150000 | 2 2161763599 | 2894639 | 150000 | 2 2161763599 | 1413464 | 100000 | 2 (9 rows)
@furkanayhan I think we should also consider increasing the wait in between batches, since we want to limit the overall pressure on the WAL archiver.
We have this current settings;
SELECT * FROM batched_background_migrations WHERE id = 118; -[ RECORD 1 ]-----+-------------------------------- id | 118 created_at | 2022-03-04 00:34:36.992258+00 updated_at | 2022-03-14 17:26:53.844891+00 min_value | 1 max_value | 2161763599 batch_size | 117002 sub_batch_size | 1000 interval | 120 status | 0 job_class_name | NullifyOrphanRunnerIdOnCiBuilds batch_class_name | PrimaryKeyBatchingStrategy table_name | ci_builds column_name | id job_arguments | [] total_tuple_count | 2038662900 pause_ms | 100 max_batch_size | 150000
I had a chat with @igorwwwwwwwwwwwwwwwwwwww today and we think that it'd be sufficient to increase
pause_ms
from100
to500
and leave the batch size at it is.I'll create a change request for this.
@furkanayhan Looking at the duration of the jobs in the database, many of them were already close to or over the interval of 120 seconds even with the previous
pause_ms
of100
.If we increase the
pause_ms
, the optimizer will eventually scale back the batch size to fit better in the interval, which is the result we want, but it will take some time for that to happen. If the goal is to cap thebatch_size
at a safer value, we might just want to set it that way so it has an immediate effect.Also, looking at metrics for individual queries, many of the updates are taking in excess of 1 second, with some taking 2-3s+ per
sub_batch
. Some of that might be due to degraded query times due to the incident, but this was happening in earlier batches as well, so we might want to also consider decreasing thesub_batch
.@pbair Thank you for sharing these insights!
From all of these, WDYT of this summary;
- Decrease
batch_size
to 50_000 to get an immediate effect so that it adjusts to going higher automatically. - Increase
pause_ms
to 200. - Decrease
sub_batch_size
to 500.
Edited by Furkan Ayhan- Decrease
@furkanayhan Sounds good to me.
Created an MR to adjust those values for self-hosted ones: !83094 (merged)
Created a production change issue: gitlab-com/gl-infra/production#6626 (closed)
@furkanayhan did we change the migration running on production? I'm curious of new estimates on how long this can potentially run to get an idea of our upcoming migration timeframe. Thanks.
@acroitor We changed it via gitlab-com/gl-infra/production#6531 (closed) before it started running.
We estimated that it'd run ~28 days. I guess this shouldn't change since it'll adjust the batch size while running, right?
AFAIK, we didn't change
batch_size
, we only changedmax_batch_size
to allowbatch_size
increasing up to that point if possible. Right?
mentioned in merge request !83094 (merged)
mentioned in issue gitlab-com/gl-infra/production#6626 (closed)
added releasedcandidate label
added releasedpublished label and removed releasedcandidate label
mentioned in merge request kubitus-project/kubitus-installer!797 (merged)
mentioned in commit vishwin/freebsd-ports@c617954a
mentioned in issue #415724 (closed)