Skip to content

Allow pruning of stale group runners

What does this MR do and why?

Describe in detail what your merge request does and why.

This MR is a follow-up to Add namespace_ci_cd_settings table (!86473 - merged) and implements a background cron worker that enables deleting stale group runners (that is, CI runners that haven't communicated with the GitLab instance in the last 3 months). The idea is for a follow-up MR to implement a GraphQL mutation that sets the flag to opt into this behavior (namespace_ci_cd_settings.allow_stale_runner_pruning).

NOTES

  1. The commits in this MR are individually reviewable;
  2. I don't have much experience developing Sidekiq jobs, so I'd appreciate additional attention to aspects there that I may have missed.
  3. The ci_cd_settings association will not exist in most cases, as this is a new table. I don't have much experience with this scenario in Rails, so looking forward to suggestions on how to best approach it.

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

Step screenshot
1. Start with some stale runners image
2. Enqueue worker in http://gdk.test:3000/admin/background_jobs image
3. Check that stale runners are no longer there image
4. Logs image

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

These are manual steps (not using the Sidekiq dashboard):

  1. Ensure you have gitlab-runner installed in your machine.

  2. Register 200 runners against a group (e.g. gitlab-org, get registration token from http://gdk.test:3000/groups/gitlab-org/-/runners), in this example we use hyperfine to help repeat the command:

    $ brew install hyperfine
    $ hyperfine --min-runs 200 'gitlab-runner register -config /tmp/config.gdk.toml \
                    --executor "shell" \
                    --url "http://gdk.test:3000/" \
                    --description "Group test runner" \
                    --tag-list "shell,mac,gdk,test" \
                    --run-untagged="false" \
                    --locked="false" \
                    --access-level="not_protected" --non-interactive \
                    --registration-token="${GROUP_REGISTRATION_TOKEN}"'
  3. Change the created_at field for the last 100 runners in the GDK console, so that they are considered stale:

    > group = ::Group.find(21)
    > group.runners.limit(100).update_all(created_at: 4.months.ago)
    > group.runners.stale.count
    => 100
  4. The group Runners page should now list half never contacted runners and half stale runners:

  5. Start the worker from the GDK console:

    > Ci::Runners::StaleGroupRunnersPruneCronWorker.new.perform
    => {:total_pruned=>100, :status=>:success}

    As expected, total_pruned returned 100 which was the count of stale runners.

Database query plans

The findings in Draft: Test deleting stale CI runners (!74503 - closed) and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5910 are relevant here, as the service logic to purge stale runners is very similar.

Check if any groups exist
SELECT 1 AS one
FROM "namespace_ci_cd_settings"
WHERE "namespace_ci_cd_settings"."allow_stale_runner_pruning" = TRUE
LIMIT 1
 Limit  (cost=0.00..0.06 rows=1 width=4) (actual time=0.014..0.014 rows=0 loops=1)
   I/O Timings: read=0.000 write=0.000
   ->  Seq Scan on public.namespace_ci_cd_settings  (cost=0.00..62.00 rows=1100 width=4) (actual time=0.005..0.006 rows=0 loops=1)
         Filter: namespace_ci_cd_settings.allow_stale_runner_pruning
         Rows Removed by Filter: 0
         I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35427

each_batch window start
SELECT "namespace_ci_cd_settings"."namespace_id"
FROM "namespace_ci_cd_settings"
WHERE "namespace_ci_cd_settings"."allow_stale_runner_pruning" = TRUE
ORDER BY "namespace_ci_cd_settings"."namespace_id" ASC
LIMIT 1
 Limit  (cost=0.12..0.15 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1)
   Buffers: shared hit=1
   I/O Timings: read=0.000 write=0.000
   ->  Index Only Scan using index_cicd_settings_on_namespace_id_where_stale_pruning_enabled on public.namespace_ci_cd_settings  (cost=0.12..27.63 rows=1100 width=8) (actual time=0.002..0.003 rows=0 loops=1)
         Heap Fetches: 0
         Buffers: shared hit=1
         I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35428

each_batch window end
SELECT "namespace_ci_cd_settings"."namespace_id"
FROM "namespace_ci_cd_settings"
WHERE "namespace_ci_cd_settings"."allow_stale_runner_pruning" = TRUE
  AND "namespace_ci_cd_settings"."namespace_id" >= 1
ORDER BY "namespace_ci_cd_settings"."namespace_id" ASC
LIMIT 1 OFFSET 1000
 Limit  (cost=23.51..23.57 rows=1 width=8) (actual time=0.036..0.036 rows=0 loops=1)
   Buffers: shared hit=4
   I/O Timings: read=0.000 write=0.000
   ->  Index Only Scan using index_cicd_settings_on_namespace_id_where_stale_pruning_enabled on public.namespace_ci_cd_settings  (cost=0.14..23.51 rows=367 width=8) (actual time=0.034..0.034 rows=0 loops=1)
         Index Cond: (namespace_ci_cd_settings.namespace_id >= 21)
         Heap Fetches: 0
         Buffers: shared hit=4
         I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35431

Delete runners from window group's
DELETE FROM "ci_runners"
WHERE "ci_runners"."id" IN (
    SELECT "ci_runners"."id"
    FROM "ci_runners"
      INNER JOIN "ci_runner_namespaces" ON "ci_runner_namespaces"."runner_id" = "ci_runners"."id"
    WHERE "ci_runner_namespaces"."namespace_id" IN (<1000 ids>)
      AND (ci_runners.created_at < '2022-02-09 16:16:31.457512'
        AND (ci_runners.contacted_at IS NULL
          OR ci_runners.contacted_at < '2022-02-09 16:16:31.457512'))
    LIMIT 5000)
 ModifyTable on public.ci_runners  (cost=30531.48..47241.86 rows=5000 width=34) (actual time=8.374..8.377 rows=0 loops=1)
   Buffers: shared hit=3005 read=6 dirtied=4
   I/O Timings: read=7.003 write=0.000
   ->  Nested Loop  (cost=30531.48..47241.86 rows=5000 width=34) (actual time=8.361..8.363 rows=0 loops=1)
         Buffers: shared hit=3005 read=6 dirtied=4
         I/O Timings: read=7.003 write=0.000
         ->  HashAggregate  (cost=30531.05..30581.05 rows=5000 width=32) (actual time=8.361..8.362 rows=0 loops=1)
               Group Key: "ANY_subquery".id
               Buffers: shared hit=3005 read=6 dirtied=4
               I/O Timings: read=7.003 write=0.000
               ->  Subquery Scan on ANY_subquery  (cost=0.86..30518.55 rows=5000 width=32) (actual time=8.331..8.332 rows=0 loops=1)
                     Buffers: shared hit=3005 read=6 dirtied=4
                     I/O Timings: read=7.003 write=0.000
                     ->  Limit  (cost=0.86..30468.55 rows=5000 width=4) (actual time=8.330..8.331 rows=0 loops=1)
                           Buffers: shared hit=3005 read=6 dirtied=4
                           I/O Timings: read=7.003 write=0.000
                           ->  Nested Loop  (cost=0.86..39237.15 rows=6439 width=4) (actual time=8.328..8.329 rows=0 loops=1)
                                 Buffers: shared hit=3005 read=6 dirtied=4
                                 I/O Timings: read=7.003 write=0.000
                                 ->  Index Scan using index_ci_runner_namespaces_on_namespace_id on public.ci_runner_namespaces  (cost=0.43..9472.16 rows=9137 width=4) (actual time=8.327..8.327 rows=0 loops=1)
                                       Index Cond: (ci_runner_namespaces.namespace_id = ANY ('{<1000 ids>}'::integer[]))
                                       Buffers: shared hit=3005 read=6 dirtied=4
                                       I/O Timings: read=7.003 write=0.000
                                 ->  Index Scan using ci_runners_pkey on public.ci_runners ci_runners_1  (cost=0.43..3.26 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
                                       Index Cond: (ci_runners_1.id = ci_runner_namespaces.runner_id)
                                       Filter: ((ci_runners_1.created_at < '2022-02-09 16:16:31.457512'::timestamp without time zone) AND ((ci_runners_1.contacted_at IS NULL) OR (ci_runners_1.contacted_at < '2022-02-09 16:16:31.457512'::timestamp without time zone)))
                                       Rows Removed by Filter: 0
                                       I/O Timings: read=0.000 write=0.000
         ->  Index Scan using ci_runners_pkey on public.ci_runners  (cost=0.43..3.34 rows=1 width=10) (actual time=0.000..0.000 rows=0 loops=0)
               Index Cond: (ci_runners.id = "ANY_subquery".id)
               I/O Timings: read=0.000 write=0.000```

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35435
</details>

## MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

* [x] I have evaluated the [MR acceptance checklist](https://docs.gitlab.com/ee/development/code_review.html#acceptance-checklist) for this MR.

## Links

- https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/6094+
- https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/2565+

Part of #361112
Edited by Miguel Rincon

Merge request reports