Skip to content

Implement worker to prune stale group runners

Pedro Pombeiro requested to merge pedropombeiro/339525/1-add-worker into master

What does this MR do and why?

Describe in detail what your merge request does and why.

This MR implements a background service that enables deleting stale group runners (that is, CI runners that haven't communicated with the GitLab instance in the last 3 months). The idea is for a follow-up MR to implement a GraphQL mutation that calls this.

NOTE 1: This MR was modeled around the existing WebHooks::DestroyService service.

NOTE 2: I don't have much experience developing Sidekiq jobs, so I'd appreciate additional attention to aspects there that I may have missed.

#361112 (closed)

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. Register 200 runners against a group (e.g. gitlab-org, get registration token from http://gdk.test:3000/groups/gitlab-org/-/runners):

    hyperfine --min-runs 200 'gitlab-runner register -config /tmp/config.gdk.toml \
                    --executor "shell" \
                    --url "http://gdk.test:3000/" \
                    --description "Group test runner" \
                    --tag-list "shell,mac,gdk,test" \
                    --run-untagged="false" \
                    --locked="false" \
                    --access-level="not_protected" --non-interactive \
                    --registration-token="${GROUP_REGISTRATION_TOKEN}"'
  2. Change the created_at field for the last 100 runners in the GDK console, so that they are considered stale:

    > group = ::Group.find(21)
    > group.runners.limit(100).update_all(created_at: 4.months.ago)
    > group.runners.stale.count
    => 100
  3. The group Runners page should now list half never contacted runners and half stale runners:

  4. Start the worker from the GDK console:

    > Ci::Runners::StaleGroupRunnersPruneWorker.new.perform(User.first, group)
    => {:async=>false, :total_pruned=>100, :status=>:success}

    As expected, total_pruned returned 100 which was the count of stale runners, and 100 being smaller than BATCH_SIZE, the work was done synchronously without going through Sidekiq. If we change another 50 runners to become stale, and artificially change Ci::Runners::StaleGroupRunnersPruneService::BATCH_SIZE to 10, then we should see 5 batches being executed in Sidekiq.

    > group.runners.limit(50).update_all(created_at: 4.months.ago)
    > group.runners.stale.count
    => 50
    > Ci::Runners::StaleGroupRunnersPruneWorker.new.perform(User.first, group)
    => {:async=>false, :total_pruned=>50, :status=>:success}

Database queries

The purging job in this MR closely follows the script created for https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5910, and tested in !74503 (closed). I'm happy to add more details or clarify things if needed.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Part of #19865 #342605 (closed) Closes Implement worker to remove stale runners from G... (#361112 - closed)

Edited by Pedro Pombeiro

Merge request reports