Remove code that ensures that a runner exists in partitioned table

!179634 (merged) initially removed some code that ensures that a runner exists in the partitioned runners table, before trying to create an attached runner manager. This created an incident in production, as seen in Sentry. The change was reverted in !179990 (merged), and this issue is about understanding the failures, and retrying the removal of the code. We now know that these runners are orphaned and haven't been assigned jobs since at least Oct'24.

So in this issue, we should:

  • put the removal of the code behind a FF for a quicker rollback.
  • add temporary code to explicitly deny job/verification requests from runners missing a required sharding_key_id. This will mimic the behavior that we'll see when we swap the ci_runners table and will avoid triggering an incident due to HTTP 500 errors.

Observations

There are group/project runners on ci_runners which are missing a sharding_key_id, and therefore weren't included in the backfill migration for ci_runners_e59bb2812d.

 info    Data state at: 2025-02-03 08:45:36 UTC

--- Query some known runners affected in incident on ci_runners
gitlabhq_dblab> SELECT id, runner_type, sharding_key_id, contacted_at FROM ci_runners WHERE id IN (156866, 157982, 36547295, 33304630, 14391666, 1926255, 16227710)
+----------+-------------+-----------------+----------------------------+
| id       | runner_type | sharding_key_id | contacted_at               |
|----------+-------------+-----------------+----------------------------|
| 156866   | 3           | <null>          | 2025-02-03 08:31:21.552325 |
| 157982   | 3           | <null>          | 2025-02-03 08:39:02.824217 |
| 1926255  | 3           | <null>          | 2025-02-03 08:39:54.608618 |
| 14391666 | 2           | <null>          | 2025-02-03 08:05:29.741258 |
| 16227710 | 2           | <null>          | 2025-02-03 08:34:12.158027 |
| 33304630 | 3           | <null>          | 2024-10-18 06:20:04.575372 |
| 36547295 | 2           | <null>          | 2024-12-18 16:26:03.853026 |
+----------+-------------+-----------------+----------------------------+
SELECT 7
Time: 0.225s

--- Query the same runners on ci_runners_e59bb2812d (missing because they don't have a sharding_key_id)
gitlabhq_dblab> SELECT id, runner_type, sharding_key_id, contacted_at FROM ci_runners_e59bb2812d WHERE id IN (156866, 157982, 36547295, 33304630, 14391666, 1926255, 16227710)
+----+-------------+-----------------+--------------+
| id | runner_type | sharding_key_id | contacted_at |
|----+-------------+-----------------+--------------|
+----+-------------+-----------------+--------------+

--- Looking at e.g. the project runners, it is clear they are orphaned (no ci_runner_projects record ties them to any project):
gitlabhq_dblab> SELECT * FROM ci_runner_projects WHERE runner_id IN (156866, 157982, 36547295, 1926255)
+----+-----------+------------+------------+------------+
| id | runner_id | created_at | updated_at | project_id |
|----+-----------+------------+------------+------------|
+----+-----------+------------+------------+------------+

gitlabhq_dblab> SELECT MAX(updated_at) AS last_updated_at, COUNT(*) FROM ci_runners WHERE ci_runners.runner_type <> 1 AND ci_runners.sharding_key_id IS NULL AND ci_runners.id NOT IN (SELECT id FROM ci_runners_e59bb2812d) AND contacted_at >= '2025-01-23'
+----------------------------+-------+
| last_updated_at            | count |
|----------------------------+-------|
| 2024-09-30 13:54:30.002781 | 3712  |
+----------------------------+-------+

Last Dec 17th, we had added code that ensured that the runner was present in the partitioned table any time that the associated runner manager contacted GitLab.com (same situation where the incident was observed). It's unclear why we'd still see errors in production, unless new runners are still being created with a NULL sharding_key_id, which doesn't seem to be the case given that the last update to a runner in this situation happened on 2024-09-30 13:54:30 UTC.

Edited by Pedro Pombeiro