Delete project and group runners belonging to no projects/groups
In #493256 (closed), we found out that more than half of project runners are not associated with any project, and 4% of group runners are not associated with any group:
We somehow need to remove these runners from the table, since to satisfy the check constraint, every row that is not an instance runner must have a sharding_key_id
.
We can also see that we have very recent runners which are no longer attached to any owner, so it is not only a question of legacy runners:
One approach could be to try and hide these runners for all practical purposes from users. The job service would not consider these runners, the GraphQL queries would not return these runners, etc. Basically always apply the filter AND runner_type = 1 OR sharding_key_id IS NOT NULL
on runner queries. This would be controlled by temporary FFs. We'd make the support team aware of this, and if no one complains for a milestone then we run a batch migration to get rid of these runners so that we have a table that has sharding keys for all applicable runners.
However, I'm not convinced of the value of hiding the runners if we can't return HTTP 403
from endpoints so as to avoid triggering rate limiting. Maybe a better approach would be plan B or plan C below.
Context: !166916 (comment 2144491189)
Investigation
TL;DR: After cross-querying the Postgres and ClickHouse databases, runners missing a project/group association (runners with no sharding_key_id
) having contacted in the last month (2024-09/08-2024/10/08) have only executed jobs for 2 projects in a single top-level namespace, belonging to a Premium account which completed the trial period in Sept 2022.
Only ~4K of the 1M orphaned runners have been in contact with GitLab.com over the last 7 days (presumably because they were created in the last week and then the project/group got deleted, not necessarily because they are still active):
We can see that there are only 3 orphaned runners created in the last week (this will stop being a problem once #493256 (closed) is merged, since sharding_key_id
will start being populated at record creation):
Also, there are no jobs from the last month that were picked up by an orphaned runner:
So while these runners are still actively requesting CI jobs, it might be acceptable to consider them orphaned and just delete them (we might need to include logic to return a success code from the jobs request endpoint if we can't get the customer to stop/replace those orphaned runners).
Implementation plan
Plan A
-
Optional: If we use feature flags: -
Create 2 feature flags: hide_group_runners_without_sharding_key
andhide_project_runners_without_sharding_key
-
Check for feature flags in the following situations: -
Ci::Runner.find_by_token
-
Ci::RegisterJobService
would need special treatment to avoid triggering rate limiting for customers that have lots of orphaned runners. We could make it return204 No jobs
ifsharding_key_id IS NULL
. - Other situations are probably not be worth gating, such as:
- GraphQL queries/mutations (e.g.
Resolvers::Ci::RunnerResolver
,Ci::RunnersFinder
, etc.) - REST API modifications
- GraphQL queries/mutations (e.g.
-
-
Wait for one or two weeks to see if anything out of the ordinary happens
-
-
Create batched background migration to delete all runners with sharding_key_id IS NULL
andrunner_type IN (2, 3)
Draft MR: Draft: Hide orphaned runners (!168841 - closed)
Plan B
- Create a temporary
ci_orphaned_runners
table to which we'd move the project/group runners with nosharding_key_id
. - In endpoints such as
POST /api/v4/jobs/request
andPOST /api/v4/runners/verify
, we'd return a success status so as to not trigger rate limiting. The endpoints to remove the runners would instead remove the respective entry from theci_orphaned_runners
. Other endpoints/UI would not see the runners as they've been removed. - Continue with the sharding work in
ci_runners
which is now unblocked. - In one or two milestones, drop the
ci_orphaned_runners
table (after having worked with customers to remove the orphaned runners).
Plan C
This is the simplest plan, but it would require improvements to the Gitlab::BackgroundMigration::BackfillPartitionedTable
class. It would see us only syncing records with valid sharding_key_id
values to the ci_runners_e59bb2812d
partitioned table. The next phase would be eventually replacing ci_runners
table. We'd normally rename ci_runners
to ci_runners_archived
and start serving data from the new partitioned table. Any remaining orphaned runners would start seeing 403 Forbidden
errors, which means we'd need to work with any customers in the next 1-2 months to get those runners replaced. Possibly build some observability that allows us to tell whether there are new top-level namespaces with orphaned runners. I believe that the list of orphaned runners in contact with the instance will be getting shorter and shorter by itself, since runners created after Oct 1st will contain a sharding_key_id
value and no longer count as an orphaned runner.
If we wanted to avoid triggering rate limiting on the job request endpoint, we could consult the ci_runners_archived
table and return 204 No jobs
in case sharding_key_id
is NULL there. But this would just be kicking the can down the road. At some point, we need to pull the plug for these runners, as we don't have a reliable way of determining who is running them.