Runner Controller Test Plan
Overview
This issue defines a test plan for the Runner Controller project, specifically covering the rollout targeted for FY26Q1 release. The plan addresses load testing, staged rollout strategy, validation, and ongoing test ownership.
Runner Controllers is part of a broader evolution from runner HTTP polling to intelligent, KAS-mediated routing. This test plan focuses on Phase 1 - Phase 1: admission control (&19660).
KAS is a new component being introduced and is designed to be horizontally scalable so we don't expect any surprises but we need to formalize our test and rollout plan for this feature.
We need to ensure the new architecture performs at an acceptable level vs the existing one.
Integration Testing
We will implement an integration test as part of the existing scaffolding in the kas repo. It builds an image with kas (and agentk) and runs it in docker (using test containers). Then we can have a test that generates load from runner(s). Rails and Gitaly are mock endpoints. We also run a real Redis container there. This would isolate kas from anything else (apart from Redis) and give us a picture of how it handles the load.
Load & Performance Testing
- We need to load test the Runner->KAS connectivity to find out how much load a single KAS instance can handle.
- What's the current job scheduling latency? How can we comparatively measure latency in the new feature?
Rollout Strategy
- We've already implemented
FF_USE_JOB_ROUTERFeature Flag- What will this be scoped to?
- Do we have a pre-prod environment to take this for a spin on first?
- Can we roll this out internally at decent scale?
- What metrics trigger a rollback?
Failure Mode
- What happens if KAS is unavailable?
Acceptance Criteria
- What's the go/no-go checklist before final rollout?
References
- See existing isolated integration test - https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/tree/master/internal/it.