Skip to content
Snippets Groups Projects

Add nullify job for orphan runner_id columns of ci_builds

Merged Furkan Ayhan requested to merge 30159-fix-data-on-ci-builds-runner-id into master
2 unresolved threads

What does this MR do and why?

When a runner gets deleted, jobs that were run with it still have runner_id. This MR adds a background migration job to nullify old orphan runner_id columns of ci_builds.

There will be 3 steps (https://docs.gitlab.com/ee/development/database/add_foreign_key_to_existing_column.html);

  1. Add a NOT VALID foreign key constraint to the column to ensure GitLab doesn't create inconsistent records. <-- !80203 (merged)
  2. Add a data migration, to fix or clean up existing records. <-- You're here
  3. Validate the whole table by making the foreign key VALID. <-- will be in 14.10

This is the 2nd step.

Related comment: #30159 (comment 836146520)

Database

Up

== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: migrating ==========
== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: migrated (0.0513s) =

Down

== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: reverting ==========
== 20220223112304 ScheduleNullifyOrphanRunnerIdOnCiBuilds: reverted (0.0000s) =

Update Query

UPDATE "ci_builds"
SET "runner_id" = NULL, "lock_version" = COALESCE("lock_version", 0) + 1
WHERE "ci_builds"."id" IN (
  SELECT "ci_builds"."id"
  FROM "ci_builds"
  LEFT OUTER JOIN ci_runners ON ci_runners.id = ci_builds.runner_id
  WHERE (ci_builds.runner_id IS NOT NULL AND ci_runners.id IS NULL)
    AND "ci_builds"."id" BETWEEN 1 AND 1000
)

Total Time

  • We have ~2B ci_builds rows.
  • With a 100k batch size, we'll have 20k batches.
  • (20k * 2 min interval) / 60 min = ~666 hours.
  • 666 hours / 24 = ~27 days.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #30159 (closed)

Edited by Furkan Ayhan

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Max Orefice
  • Max Orefice
  • Max Orefice removed review request for @morefice

    removed review request for @morefice

  • Furkan Ayhan changed the description

    changed the description

  • Furkan Ayhan requested review from @morefice

    requested review from @morefice

  • Database migrations

    Migrations included in this change have been executed on gitlab.com data for testing purposes. For details, please see the migration testing pipeline (limited access).

    Migration Type Total runtime Result DB size change
    20220223112304 - ScheduleNullifyOrphanRunnerIdOnCiBuilds Post deploy 1.5 s :white_check_mark: +0.00 B
    Runtime Histogram for all migrations
    Query Runtime Count
    0 seconds - 0.01 seconds 0
    0.01 seconds - 0.1 seconds 3
    0.1 seconds - 1 second 1
    1 second - 5 minutes 0
    5 minutes + 0

    Migration: 20220223112304 - ScheduleNullifyOrphanRunnerIdOnCiBuilds

    • Type: Post deploy
    • Duration: 1.5 s
    • Database size change: +0.00 B
    Query Calls Total Time Max Time Mean Time Rows
    INSERT INTO "batched_background_migrations" ("created_at", "updated_at", "max_value", "batch_size", "sub_batch_size", "interval", "status", "job_class_name", "table_name", "column_name", "job_arguments", "total_tuple_count", "max_batch_size") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id" /*application:test,db_config_name:main*/
    1 4.7 ms 4.7 ms 4.7 ms 1
    SELECT MAX("id")
    FROM "ci_builds"
    /*application:test,db_config_name:main*/
    1 2.4 ms 2.4 ms 2.4 ms 1
    SELECT $1 AS one
    FROM "batched_background_migrations"
    WHERE "batched_background_migrations"."job_class_name" = $2 AND "batched_background_migrations"."table_name" = $3 AND "batched_background_migrations"."column_name" = $4 AND (job_arguments = $5)
    LIMIT $6 /*application:test,db_config_name:main*/
    1 0.7 ms 0.7 ms 0.7 ms 0
    SELECT $1 AS one
    FROM "batched_background_migrations"
    WHERE "batched_background_migrations"."job_arguments" = $2 AND "batched_background_migrations"."job_class_name" = $3 AND "batched_background_migrations"."table_name" = $4 AND "batched_background_migrations"."column_name" = $5
    LIMIT $6 /*application:test,db_config_name:main*/
    1 0.0 ms 0.0 ms 0.0 ms 0
    Histogram for ScheduleNullifyOrphanRunnerIdOnCiBuilds
    Query Runtime Count
    0 seconds - 0.01 seconds 0
    0.01 seconds - 0.1 seconds 3
    0.1 seconds - 1 second 1
    1 second - 5 minutes 0
    5 minutes + 0

    Other migrations pending on GitLab.com
    Migration Type Total runtime Result DB size change

    Clone Details

    Clone ID Clone Created At Clone Data Timestamp Expected Removal Time
    database-testing-1070689 2022-03-03 06:44:46 UTC 2022-03-03 04:10:00 UTC 2022-03-03 18:45:42 +0000

    Artifacts


    Brought to you by gitlab-org/database-team/gitlab-com-database-testing. Epic

  • Max Orefice requested review from @ahegyi

    requested review from @ahegyi

  • Max Orefice approved this merge request

    approved this merge request

  • Max Orefice removed review request for @morefice

    removed review request for @morefice

  • :wave: @morefice, thanks for approving this merge request.

    This is the first time the merge request is approved. To ensure full test coverage, a new pipeline has been started.

    For more info, please refer to the following links:

  • Adam Hegyi
  • Adam Hegyi
  • Furkan Ayhan added 442 commits

    added 442 commits

    Compare with previous version

  • mentioned in issue #30159 (closed)

  • Adam Hegyi approved this merge request

    approved this merge request

  • added databaseapproved label and removed databasereviewed label

  • Furkan Ayhan added 1 commit

    added 1 commit

    Compare with previous version

  • Furkan Ayhan changed the description

    changed the description

  • Furkan Ayhan requested review from @grzesiek

    requested review from @grzesiek

  • I wonder if it would be better to wait with changes like this until we have partitioning, but this might not be soon enough :thinking:

  • Grzegorz Bizon approved this merge request

    approved this merge request

  • Grzegorz Bizon removed review request for @grzesiek

    removed review request for @grzesiek

  • Furkan Ayhan changed the description

    changed the description

  • Furkan Ayhan added 232 commits

    added 232 commits

    • e0a455e5...5996141e - 228 commits from branch master
    • fcac5e3e - Add nullify job for orphan runner_id columns of ci_builds
    • 508e14ba - Remove condition from the batch
    • 12fdb0fa - Increase the batch size
    • 2cab634b - Move to queue_batched_background_migration

    Compare with previous version

  • Adam Hegyi resolved all threads

    resolved all threads

  • Adam Hegyi added 406 commits

    added 406 commits

    • 2cab634b...f2dfcdc6 - 402 commits from branch master
    • 120c2968 - Add nullify job for orphan runner_id columns of ci_builds
    • 6c250d18 - Remove condition from the batch
    • 21f9f4f6 - Increase the batch size
    • b3d5a9c8 - Move to queue_batched_background_migration

    Compare with previous version

  • Adam Hegyi enabled an automatic merge when the pipeline for d50b9d31 succeeds

    enabled an automatic merge when the pipeline for d50b9d31 succeeds

  • merged

  • Adam Hegyi mentioned in commit 178b2725

    mentioned in commit 178b2725

  • added workflowcanary label and removed workflowstaging label

  • added workflowproduction label and removed workflowcanary label

    • Author Maintainer

      From @acroitor's comment;

      SELECT bm.max_value, bmj.max_value, bmj.batch_size
      FROM batched_background_migration_jobs bmj
      INNER JOIN batched_background_migrations bm ON bm.id = bmj.batched_background_migration_id
      WHERE batched_background_migration_id = 118
      ORDER BY bmj.id DESC
      LIMIT 10;
      
       max_value  | max_value | batch_size
      ------------+-----------+------------
       2161763599 |  11213813 |     128657
       2161763599 |  11073331 |     132720
       2161763599 |  10928943 |     142500
       2161763599 |  10775360 |     150000
       2161763599 |  10611543 |     150000
       2161763599 |  10448444 |     150000
       2161763599 |  10275265 |     150000
       2161763599 |  10109116 |     150000
       2161763599 |   9944673 |     150000
       2161763599 |   9778060 |     150000
      (10 rows)
      SELECT bm.max_value, bmj.max_value, bmj.batch_size, bmj.status
      FROM batched_background_migration_jobs bmj
      INNER JOIN batched_background_migrations bm ON bm.id = bmj.batched_background_migration_id
      WHERE batched_background_migration_id = 118 AND bmj.status <> 3
      ORDER BY bmj.id DESC;
      
       max_value  | max_value | batch_size | status
      ------------+-----------+------------+--------
       2161763599 |  11213813 |     128657 |      1
       2161763599 |   7120052 |     142500 |      2
       2161763599 |   6315007 |     150000 |      2
       2161763599 |   3571406 |     150000 |      2
       2161763599 |   3400075 |     150000 |      2
       2161763599 |   3229841 |     150000 |      2
       2161763599 |   3063179 |     150000 |      2
       2161763599 |   2894639 |     150000 |      2
       2161763599 |   1413464 |     100000 |      2
      (9 rows)
    • Author Maintainer

      What should we do?

      1. Lower the batch size?
      2. Retry the failed jobs?
      3. If we lower the batch size before retrying the failed jobs, can it be a problem?

      cc @ahegyi @pbair

    • @furkanayhan I think we should also consider increasing the wait in between batches, since we want to limit the overall pressure on the WAL archiver.

    • Author Maintainer

      We have this current settings;

      SELECT *
      FROM batched_background_migrations
      WHERE id = 118;
      
      -[ RECORD 1 ]-----+--------------------------------
      id                | 118
      created_at        | 2022-03-04 00:34:36.992258+00
      updated_at        | 2022-03-14 17:26:53.844891+00
      min_value         | 1
      max_value         | 2161763599
      batch_size        | 117002
      sub_batch_size    | 1000
      interval          | 120
      status            | 0
      job_class_name    | NullifyOrphanRunnerIdOnCiBuilds
      batch_class_name  | PrimaryKeyBatchingStrategy
      table_name        | ci_builds
      column_name       | id
      job_arguments     | []
      total_tuple_count | 2038662900
      pause_ms          | 100
      max_batch_size    | 150000

      I had a chat with @igorwwwwwwwwwwwwwwwwwwww today and we think that it'd be sufficient to increase pause_ms from 100 to 500 and leave the batch size at it is.

      I'll create a change request for this.

      WDYT? @ahegyi @pbair

    • @furkanayhan Looking at the duration of the jobs in the database, many of them were already close to or over the interval of 120 seconds even with the previous pause_ms of 100.

      If we increase the pause_ms, the optimizer will eventually scale back the batch size to fit better in the interval, which is the result we want, but it will take some time for that to happen. If the goal is to cap the batch_size at a safer value, we might just want to set it that way so it has an immediate effect.

      Also, looking at metrics for individual queries, many of the updates are taking in excess of 1 second, with some taking 2-3s+ per sub_batch. Some of that might be due to degraded query times due to the incident, but this was happening in earlier batches as well, so we might want to also consider decreasing the sub_batch.

    • Author Maintainer

      @pbair Thank you for sharing these insights!

      From all of these, WDYT of this summary;

      • Decrease batch_size to 50_000 to get an immediate effect so that it adjusts to going higher automatically.
      • Increase pause_ms to 200.
      • Decrease sub_batch_size to 500.

      cc @ahegyi @igorwwwwwwwwwwwwwwwwwwww

      Edited by Furkan Ayhan
    • @furkanayhan Sounds good to me.

    • Author Maintainer

      Created an MR to adjust those values for self-hosted ones: !83094 (merged)

    • Author Maintainer

      Created a production change issue: gitlab-com/gl-infra/production#6626 (closed)

    • Please register or sign in to reply
  • Furkan Ayhan mentioned in merge request !83094 (merged)

    mentioned in merge request !83094 (merged)

  • mentioned in issue #415724 (closed)

  • Please register or sign in to reply
    Loading