Improve QA tests to more thoroughly cover runner manager activity

In the recent gitlab-com/gl-infra/production#18792 (closed) S2 incident, customers were unable to register/update certain runner managers due to a prematurely added FK constraint added to the new ci_runner_machines_687967fa8a table. Even though this table is not yet being used, record changes are synced from ci_runner_machines to it. So when a record was updated and the matching runner didn't exist in ci_runners_e59bb2812d, customers saw an HTTP 500 error:

SQL error in Kibana logs

PG::ForeignKeyViolation: ERROR:  insert or update on table "group_type_ci_runner_machines_687967fa8a" violates foreign key constraint "fk_rails_3f92913d27"
DETAIL:  Key (runner_id, runner_type)=(41684082, 2) is not present in table "group_type_ci_runners_e59bb2812d".
CONTEXT:  SQL statement "INSERT INTO ci_runner_machines_687967fa8a ("id",
    "runner_id",
    "sharding_key_id",
    "created_at",
    "updated_at",
    "contacted_at",
    "creation_state",
    "executor_type",
    "runner_type",
    "config",
    "system_xid",
    "platform",
    "architecture",
    "revision",
    "ip_address",
    "version")
  VALUES (NEW."id",
    NEW."runner_id",
    NEW."sharding_key_id",
    NEW."created_at",
    NEW."updated_at",
    NEW."contacted_at",
    NEW."creation_state",
    NEW."executor_type",
    NEW."runner_type",
    NEW."config",
    NEW."system_xid",
    NEW."platform",
    NEW."architecture",
    NEW."revision",
    NEW."ip_address",
    NEW."version")"
PL/pgSQL function table_sync_function_e438f29263() line 24 at SQL statement

The fact that not a single error was logged in the staging environment in the run-up to the production deployment leads me to think that we have a test under-coverage of this scenario:

Having these tests would likely have helped uncover the scenario in gitlab-com/gl-infra/production#18679 (closed) before it hit production too.

Proposal

Try to emulate this scenario in staging QA, where we hit certain REST endpoints (for example POST /runners, POST /runners/verify, POST /jobs/request, and possibly DELETE /runners) for many different runners (instance, group, and project types), both for recent runners and for stale ones. Ideally, we would also have an alerting system in place for errors resulting from the REST calls.

Test 1

To set up for this scenario reproducing what happened in gitlab-com/gl-infra/production#18679 (closed), apply the following patch:

diff --git a/app/models/ci/runner.rb b/app/models/ci/runner.rb
index 1b066ace9994..24b0e126c8c5 100644
--- a/app/models/ci/runner.rb
+++ b/app/models/ci/runner.rb
@@ -523,10 +523,12 @@ def compute_token_expiration
 
     def ensure_manager(system_xid)
       # rubocop: disable Performance/ActiveRecordSubtransactionMethods -- This is used only in API endpoints outside of transactions
-      RunnerManager.safe_find_or_create_by!(runner_id: id, system_xid: system_xid.to_s) do |m|
-        m.runner_type = runner_type
-        m.sharding_key_id = sharding_key_id
-      end
+      RunnerManager.safe_find_or_create_by!(
+        runner_id: id,
+        runner_type: runner_type,
+        sharding_key_id: sharding_key_id,
+        system_xid: system_xid.to_s,
+        &blk)
       # rubocop: enable Performance/ActiveRecordSubtransactionMethods
     end

This scenario seems to be covered already by the qa/specs/features/browser_ui/4_verify/runner/register_runner_spec.rb test, although this test didn't fail on the original MR (it seems the tests weren't selected to run for some reason, I've left feedback in gitlab-org/quality/quality-engineering&47 (comment 2243025179)).

We should also check that created group and instance runners and their runner managers are online, not just project ones.

Tasks

Test registering a runner with a deprecated runner registration token (POST /runners)
Test deleting a runner with its authentication token (DELETE /runners)
Test verifying a runner with its authentication token (POST /runners/verify)
Test having a registered runner successfully request jobs (POST /jobs/request)
Test registering an instance runner with an authentication token (POST /users/runners)
Add validation to visit runner page and confirm that the runner managers are online to register_group_runner_spec.rb

Test 2

NOTE: I don't think we'll be able to cover this very specific scenario, since it depends on having a particular state in the database:

To set up for this scenario reproducing what happened in gitlab-com/gl-infra/production#18792 (closed), run the following against the gitlabhq_development_ci database locally:

An existing runner which is present in the ci_runners table, but not on the new ci_runners_e59bb2812d one.
When a runner manager is registered against this existing runner, the app will try to create a runner manager in the ci_runner_machines table, triggering a sync function to create an identical record in the new ci_runner_machines_687967fa8a. The problem is that the constraint above will fail due to the missing ci_runners_e59bb2812d record.

ALTER TABLE ONLY group_type_ci_runner_machines_687967fa8a
  ADD CONSTRAINT fk_rails_3f92913d27 FOREIGN KEY (runner_id, runner_type) REFERENCES group_type_ci_runners_e59bb2812d
    (id, runner_type) ON UPDATE CASCADE ON DELETE CASCADE;

We'd also need to add a similar spec for instance runners, which don't seem to be covered at the moment.

Edited Jan 15, 2025 by Pedro Pombeiro