Delete project and group runners belonging to no projects/groups

In #493256 (closed), we found out that more than half of project runners are not associated with any project, and 4% of group runners are not associated with any group:

We somehow need to remove these runners from the table, since to satisfy the check constraint, every row that is not an instance runner must have a sharding_key_id.

We can also see that we have very recent runners which are no longer attached to any owner, so it is not only a question of legacy runners:

One approach could be to try and hide these runners for all practical purposes from users. The job service would not consider these runners, the GraphQL queries would not return these runners, etc. Basically always apply the filter AND runner_type = 1 OR sharding_key_id IS NOT NULL on runner queries. This would be controlled by temporary FFs. We'd make the support team aware of this, and if no one complains for a milestone then we run a batch migration to get rid of these runners so that we have a table that has sharding keys for all applicable runners.

However, I'm not convinced of the value of hiding the runners if we can't return HTTP 403 from endpoints so as to avoid triggering rate limiting. Maybe a better approach would be plan B or plan C below.

Context: !166916 (comment 2144491189)

Investigation

TL;DR: After cross-querying the Postgres and ClickHouse databases, runners missing a project/group association (runners with no sharding_key_id) having contacted in the last month (2024-09/08-2024/10/08) have only executed jobs for 2 projects in a single top-level namespace, belonging to a Premium account which completed the trial period in Sept 2022.

Only ~4K of the 1M orphaned runners have been in contact with GitLab.com over the last 7 days (presumably because they were created in the last week and then the project/group got deleted, not necessarily because they are still active):

We can see that there are only 3 orphaned runners created in the last week (this will stop being a problem once #493256 (closed) is merged, since sharding_key_id will start being populated at record creation):

Also, there are no jobs from the last month that were picked up by an orphaned runner:

So while these runners are still actively requesting CI jobs, it might be acceptable to consider them orphaned and just delete them (we might need to include logic to return a success code from the jobs request endpoint if we can't get the customer to stop/replace those orphaned runners).

Implementation plan

Plan A

Optional: If we use feature flags:
- Create 2 feature flags: hide_group_runners_without_sharding_key and hide_project_runners_without_sharding_key
- Check for feature flags in the following situations:
  - Ci::Runner.find_by_token
  - Ci::RegisterJobService would need special treatment to avoid triggering rate limiting for customers that have lots of orphaned runners. We could make it return 204 No jobs if sharding_key_id IS NULL.
  - Other situations are probably not be worth gating, such as:
    - GraphQL queries/mutations (e.g. Resolvers::Ci::RunnerResolver, Ci::RunnersFinder, etc.)
    - REST API modifications
- Wait for one or two weeks to see if anything out of the ordinary happens
Create batched background migration to delete all runners with sharding_key_id IS NULL and runner_type IN (2, 3)

Draft MR: Draft: Hide orphaned runners (!168841 - closed)

Plan B

Create a temporary ci_orphaned_runners table to which we'd move the project/group runners with no sharding_key_id.
In endpoints such as POST /api/v4/jobs/request and POST /api/v4/runners/verify, we'd return a success status so as to not trigger rate limiting. The endpoints to remove the runners would instead remove the respective entry from the ci_orphaned_runners. Other endpoints/UI would not see the runners as they've been removed.
Continue with the sharding work in ci_runners which is now unblocked.
In one or two milestones, drop the ci_orphaned_runners table (after having worked with customers to remove the orphaned runners).

Plan C

This is the simplest plan, but it would require improvements to the Gitlab::BackgroundMigration::BackfillPartitionedTable class. It would see us only syncing records with valid sharding_key_id values to the ci_runners_e59bb2812d partitioned table. The next phase would be eventually replacing ci_runners table. We'd normally rename ci_runners to ci_runners_archived and start serving data from the new partitioned table. Any remaining orphaned runners would start seeing 403 Forbidden errors, which means we'd need to work with any customers in the next 1-2 months to get those runners replaced. Possibly build some observability that allows us to tell whether there are new top-level namespaces with orphaned runners. I believe that the list of orphaned runners in contact with the instance will be getting shorter and shorter by itself, since runners created after Oct 1st will contain a sharding_key_id value and no longer count as an orphaned runner.

If we wanted to avoid triggering rate limiting on the job request endpoint, we could consult the ci_runners_archived table and return 204 No jobs in case sharding_key_id is NULL there. But this would just be kicking the can down the road. At some point, we need to pull the plug for these runners, as we don't have a reliable way of determining who is running them.

Edited Oct 22, 2024 by Pedro Pombeiro