Clean up of stale runners at the group level
Release notes
One of the features at the core of GitLab CI's flexibility is that a user can easily create a project's CI build environment (Runner). So if users need custom packages in their build environment or need to experiment with more complex CI build workflows, they can do so on Runners that they install and register to a project. However, this flexibility also means that in larger organizations, there is the potential that users do not remove inactive runners from their projects. This leads to the growth in the database of large numbers of stale runners, i.e., runners that are no longer active but whose records persist in the database.
In this release, the first iteration of the clean up stale runner
feature is available for GitLab Ultimate plans on GitLab SaaS and GitLab Self-Managed. Group maintainers can set the enable stale runners cleanup toggle switch to active. Once activated, a background cron worker will delete runners that have not contacted GitLab in the last three months.
The worker will delete stale runners associated at the group level. Runners associated to descendant groups or projects will not be deleted.
Group view:enable stale runner clean-up toggle switch
Problem(s)
GitLab SaaS
As part of https://gitlab.com/gitlab-org/gitlab/-/issues/321368#note_689886009 I've identified that from most namespaces containing more than 1,000 runners, only <1% of the runners have contacted GitLab.com recently. There is for example a project with 250K+ runners and a namespace with 350K+ runners. This causes unnecessary load on the database and makes it unnecessarily harder to estimate the performance of a given query.
We've recently enabled the ci_runner_limits
FF which aims to keep a ceiling of 1000 runners per namespace/project. Still, the user can have 1000 runners registered but only be using 10 of them, so we should have a way of identifying this situation and automatically prune them after a certain time.
Self Managed Runners
The Runner host is deleted before the Runner is removed from the GitLab instance resulting in an orphaned runner. An orphaned Runner is still listed in the database, attached to the instance, group, or project in GitLab but is no longer available to process CI jobs.
Scenario 1: In one user's example, the use of AWS's Spot fleet to host runners resulted in several orphaned runners during periods where Amazon interrupted the instances due to price fluctuations.
Scenario 2: A bad Helm install of Runner resulted in many non-existent Runners registered and visible in the UI.
Proposal
worker
on GitLab SaaS to auto-remove stale runners attached to a namespace
MVC - Implement a -
Create background job - once invoked it will query the database to determine if there are runners that can be deleted. -
Create cron job. -
Create a log of the start and end times. A stretch goal is to add to the audit log is the list of runners to include in the deletion job. -
Add feature flag to rollout the MVC. -
The MVC solution will only work at the individual namespace level. -
Check GitLab Ultimate license in crob job.
Note - The target ship milestone for shipping the MVC (behind a feature flag) is 15.0. Iteration 2 builds on the MVC. The iteration 2 tasks are required to deliver a full functional feature to customers.
Iteration 2
-
Add audit logging. -
Add an option to Admin Area > Runners to enable a user to configure and activate the stale_group_runners_prune_service
*Note For iteration 2 we need to consider runners of sub-groups as candidates for deletion.
Disclaimer
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.