Implement worker to prune stale group runners (!84960) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Describe in detail what your merge request does and why.

This MR implements a background service that enables deleting stale group runners (that is, CI runners that haven't communicated with the GitLab instance in the last 3 months). The idea is for a follow-up MR to implement a GraphQL mutation that calls this.

NOTE 1: This MR was modeled around the existing WebHooks::DestroyService service.

NOTE 2: I don't have much experience developing Sidekiq jobs, so I'd appreciate additional attention to aspects there that I may have missed.

#361112 (closed)

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Register 200 runners against a group (e.g. gitlab-org, get registration token from http://gdk.test:3000/groups/gitlab-org/-/runners):

hyperfine --min-runs 200 'gitlab-runner register -config /tmp/config.gdk.toml \
                --executor "shell" \
                --url "http://gdk.test:3000/" \
                --description "Group test runner" \
                --tag-list "shell,mac,gdk,test" \
                --run-untagged="false" \
                --locked="false" \
                --access-level="not_protected" --non-interactive \
                --registration-token="${GROUP_REGISTRATION_TOKEN}"'

Change the created_at field for the last 100 runners in the GDK console, so that they are considered stale:

> group = ::Group.find(21)
> group.runners.limit(100).update_all(created_at: 4.months.ago)
> group.runners.stale.count
=> 100

The group Runners page should now list half never contacted runners and half stale runners:
Start the worker from the GDK console:
```
> Ci::Runners::StaleGroupRunnersPruneWorker.new.perform(User.first, group)
=> {:async=>false, :total_pruned=>100, :status=>:success}
```
As expected, total_pruned returned 100 which was the count of stale runners, and 100 being smaller than BATCH_SIZE, the work was done synchronously without going through Sidekiq. If we change another 50 runners to become stale, and artificially change Ci::Runners::StaleGroupRunnersPruneService::BATCH_SIZE to 10, then we should see 5 batches being executed in Sidekiq.
```
> group.runners.limit(50).update_all(created_at: 4.months.ago)
> group.runners.stale.count
=> 50
> Ci::Runners::StaleGroupRunnersPruneWorker.new.perform(User.first, group)
=> {:async=>false, :total_pruned=>50, :status=>:success}
```

Database queries

The purging job in this MR closely follows the script created for https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5910, and tested in !74503 (closed). I'm happy to add more details or clarify things if needed.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Part of #19865 #342605 (closed) Closes Implement worker to remove stale runners from G... (#361112 - closed)

Edited May 03, 2022 by Pedro Pombeiro

Implement worker to prune stale group runners

What does this MR do and why?

Screenshots or screen recordings

How to set up and validate locally

Database queries

MR acceptance checklist

Merge request reports