Skip to content

Update sharding_key_id for project runners when original project is deleted

Context

Project owners have long been capable of being associated with multiple projects, without a clear idea of which project is the owner of the runner. In GraphQL API: Extend `runnerUpdate` mutation to ... (#359958 - closed), we added the concept of a project runner owner, and convened on the simple idea of defaulting the owner to the first project that was associated with the runner (which could easily be checked by finding the ci_runner_projects join record with the lowest id value). If that project gets deleted, then automatically the next most ancient associated project becomes the runner owner. This is not an ideal model, but it is what we've had for a long time, and something we're not ready to change at the moment. (We're discussing deleting the project's runners when a project is deleted).

With the Cells 1.0 work, we were required to add a sharding_key_id value to each runner, which in the case of a project runner becomes the ID of the project that was first associated with the runner. This sounds very similar to the logic described above, and we took the opportunity to simplify the database queries that calculate a project owner at the same time. The problem arises when the project referred to by sharding_key_id is deleted. The value is currently not updated, and therefore Ci::Runner#owner returns nil, which doesn't match the previous behavior. This MR fixes that change in behavior, and also improves it by deleting any runners that become orphaned, by lack of other associated projects that could adopt the runner.

What does this MR do?

NOTE: This MR is probably best reviewed commit-by-commit

NOTE: The ci_runner_taggings table is updated in a follow-up MR

This MR adds logic to ensure that the ci_runners.sharding_key_id is kept valid as much as possible, when the associated project is deleted. Currently, the sharding_key_id value is left pointing to the deleted project, meaning that it can't be used to determine the owning organization, and Ci::Runner#owner returns nil. The prevailing logic so far has been to assume owner is established by the associated ci_runner_projects record with the lowest ID, so this MR re-establishes that logic by subscribing to the ProjectDeletedEvent and calling a service to recalculate the new sharding_key_id for it and its associated ci_runner_machines records. If there is no other associated project to fall back to, we delete the runner so as to avoid leaving orphaned runners behind.

This logic is behind a feature flag and it will only be enabled once !176865 (merged) that adds the required indices has been deployed to production on .com, as suggested in https://docs.gitlab.com/ee/development/database/adding_database_indexes.html#add-an-index-to-support-new-or-updated-queries.

References

Please include cross links to any resources that are relevant to this MR. This will give reviewers and future readers helpful context to give an efficient review of the changes introduced.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. In rails console enable the experiment fully

    Feature.enable(:update_project_runners_owner)
  2. Go to http://gdk.test:3000/gitlab-org and create a new to-be-deleted project.

  3. Go to http://gdk.test:3000/gitlab-org/to-be-deleted/-/runners/new to create a new project runner (remember to check Run untagged jobs).

  4. Repeat step 2 to have a second runner, taking note of the runner IDs from the address bar.

    image

  5. Go to a different project, e.g. http://gdk.test:3000/gitlab-org/gitlab-test/-/settings/ci_cd?expand_runners=true#js-runners-settings

  6. Locate the last runner you created.

    • If the runner is not listed under the Other available runners section, the background job that syncs new projects to the CI database may be stuck in the GDK.
    • In that case, run gdk restart rails-background-jobs and give it a minute to run the worker job.

    image

  7. After having located the runner, click Enable for this project:

    image

  8. At this point we have a runner associated with the first project, and a second runner that is associated with 2 projects. We can verify this in the SQL console:

    gitlabhq_development_ci> SELECT id, sharding_key_id FROM ci_runners WHERE runner_type = 3 ORDER BY id DESC LIMIT 2
    +------+-----------------+
    | id   | sharding_key_id |
    |------+-----------------|
    | 2197 | 96              |
    | 2196 | 96              |
    +------+-----------------+
    SELECT 2
    
    gitlabhq_development_ci> SELECT * FROM ci_runner_projects ORDER BY id DESC LIMIT 3
    +------+-----------+----------------------------+----------------------------+------------+
    | id   | runner_id | created_at                 | updated_at                 | project_id |
    |------+-----------+----------------------------+----------------------------+------------|
    | 1802 | 2197      | 2025-01-03 17:58:20.381519 | 2025-01-03 17:58:20.381519 | 2          |
    | 1801 | 2197      | 2025-01-03 17:44:09.877662 | 2025-01-03 17:44:09.877662 | 96         |
    | 1800 | 2196      | 2025-01-03 17:42:56.487749 | 2025-01-03 17:42:56.487749 | 96         |
    +------+-----------+----------------------------+----------------------------+------------+
    SELECT 3
    Time: 0.003s
  9. Now let's delete the project we just created. We want to verify that the runner sharding_key_id is updated to the gitlab-org/gitlab-test project. Call the following code from the Rails console:

    ::Projects::DestroyService.new(Project.last, User.first, {}).execute

If we run the SQL queries again, we can see that the runner that was associated with 2 projects was re-associated to the other project, while the first runner (which had only 1 associated project, the to-be-deleted) stayed with the same ID, since there is no better ID to provide it. One option here would be to implicitly delete any runners that are no longer associated with any projects.

gitlabhq_development_ci> SELECT id, sharding_key_id FROM ci_runners WHERE runner_type = 3 ORDER BY id DESC LIMIT 2
+------+-----------------+
| id   | sharding_key_id |
|------+-----------------|
| 2197 | 2               |
| 2196 | 96              |
+------+-----------------+
SELECT 2
Time: 0.006s

Database query plans

I've selected the project with the most associated runners in .com:

gitlabhq_dblab> SELECT sharding_key_id, COUNT(*) AS count FROM ci_runners WHERE runner_type = 3 AND sharding_key_id IS NOT NULL GROUP BY sharding_key_id ORDER BY count DESC LIMIT 1
+-----------------+-------+
| sharding_key_id | count |
|-----------------+-------|
| 24519238        | 16615 |
+-----------------+-------+
SELECT 1
Time: 79.921s (1 minute 19 seconds), executed in: 79.917s (1 minute 19 seconds)

Let's change the sharding_key_id to another value to simulate the deleted project, and the SQL commands will update the sharding_key_id back to the original value:

exec UPDATE ci_runners SET sharding_key_id = 100000000 WHERE runner_type = 3 AND sharding_key_id = 24519238;
exec UPDATE ci_runner_machines SET sharding_key_id = 100000000 WHERE runner_type = 3 AND sharding_key_id = 24519238;
Ci::RunnerProject.belonging_to_project(@project_id).each_batch { |batch| batch.delete_all }

https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/35098/commands/108828

DELETE FROM ci_runner_projects
WHERE project_id = 24519238
  AND id >= 5911679
  AND id < 14340370
Time: 3.361 s  
  - planning: 0.563 ms  
  - execution: 3.360 s  
    - I/O read: 3.313 s  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 1657 (~12.90 MiB) from the buffer pool  
  - reads: 7497 (~58.60 MiB) from the OS file cache, including disk I/O  
  - dirtied: 1065 (~8.30 MiB)  
  - writes: 0  
each_batch(of: 1000)

Finding lower ID of window: https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/35098/commands/108829

SELECT "ci_runners".id
FROM "ci_runners"
WHERE "ci_runners"."runner_type" = 3
  AND "ci_runners"."sharding_key_id" = 100000000
ORDER BY id
LIMIT 1
Time: 34.412 ms  
  - planning: 1.332 ms  
  - execution: 33.080 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 6228 (~48.70 MiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  

Finding highest ID in window: https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/35098/commands/108830

SELECT "ci_runners".id
FROM "ci_runners"
WHERE "ci_runners"."runner_type" = 3
  AND "ci_runners"."sharding_key_id" = 100000000
  AND id >= 14453683
ORDER BY id OFFSET 1000
LIMIT 1
Time: 23.701 ms  
  - planning: 1.419 ms  
  - execution: 22.282 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 6228 (~48.70 MiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
Ci::Runner.id_in(runner_ids).update_all(update_query)

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/35098/commands/108831

UPDATE
  "ci_runners"
SET sharding_key_id = (
    SELECT "ci_runner_projects"."project_id"
    FROM "ci_runner_projects"
    WHERE "ci_runner_projects"."runner_id" = "ci_runners"."id"
    ORDER BY "ci_runner_projects"."id" ASC
    LIMIT 1)
WHERE "ci_runners"."id" IN (<1000 runner IDs...>)
  AND (EXISTS (
      SELECT 1
      FROM "ci_runner_projects"
      WHERE "ci_runner_projects"."runner_id" = "ci_runners"."id"
      LIMIT 1))
Time: 236.345 ms  
  - planning: 2.331 ms  
  - execution: 234.014 ms  
    - I/O read: 8.041 ms  
    - I/O write: 0.130 ms  
  
Shared buffers:  
  - hits: 68640 (~536.30 MiB) from the buffer pool  
  - reads: 174 (~1.40 MiB) from the OS file cache, including disk I/O  
  - dirtied: 112 (~896.00 KiB)  
  - writes: 2 (~16.00 KiB)  
runners_with_fallback_owner.limit(BATCH_SIZE).pluck_primary_key

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/35098/commands/108832

SELECT "ci_runners"."id"
FROM "ci_runners"
WHERE "ci_runners"."id" IN (<1000 runner IDs...>)
  AND (EXISTS (SELECT 1 FROM "ci_runner_projects" WHERE "ci_runner_projects"."runner_id" = "ci_runners"."id" LIMIT 1))
Time: 9.817 ms  
  - planning: 2.428 ms  
  - execution: 7.389 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 8579 (~67.00 MiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
Ci::RunnerManager.project_type.for_runner(runner_ids).update_all(update_query)

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/35098/commands/108833

UPDATE
  "ci_runner_machines"
SET sharding_key_id = (
    SELECT "ci_runner_projects"."project_id"
    FROM "ci_runner_projects"
    WHERE "ci_runner_projects"."runner_id" = "ci_runner_machines"."runner_id"
    ORDER BY "ci_runner_projects"."id" ASC
    LIMIT 1)
WHERE "ci_runner_machines"."runner_type" = 3
  AND "ci_runner_machines"."runner_id" IN (
    SELECT "ci_runners"."id"
    FROM "ci_runners"
    WHERE "ci_runners"."id" (<1000 runner IDs...>))
Time: 70.655 ms  
  - planning: 3.150 ms  
  - execution: 67.505 ms  
    - I/O read: 62.445 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 7655 (~59.80 MiB) from the buffer pool  
  - reads: 92 (~736.00 KiB) from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
runners_missing_owner_project.project_type.with_sharding_key(project_id).delete_all

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/35098/commands/108834

DELETE FROM "ci_runners"
WHERE "ci_runners"."id" IN (<1000 runner IDs...>)
  AND (EXISTS (
      SELECT 1
      FROM "ci_runner_projects"
      WHERE "ci_runner_projects"."runner_id" = "ci_runners"."id"))
  AND "ci_runners"."runner_type" = 3
  AND "ci_runners"."sharding_key_id" = 100000000
Time: 369.528 ms  
  - planning: 2.474 ms  
  - execution: 367.054 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 12241 (~95.60 MiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
Edited by Pedro Pombeiro

Merge request reports

Loading