Adaptive request concurrency scales up under 429/5xx back-pressure instead of down (#39398) · Issues · GitLab.org / gitlab-runner

Adaptive request concurrency scales up under 429/5xx back-pressure instead of down

## Summary When `FF_USE_ADAPTIVE_REQUEST_CONCURRENCY` is on (the default), the controller treats requests that succeeded only after the network retry layer retried past a 429 or 5xx as evidence of server capacity, and scales concurrency up. Under sustained back-pressure with jobs available, the adaptive limit climbs to the `request_concurrency` cap and stays there. ## Steps to reproduce 1. `FF_USE_ADAPTIVE_REQUEST_CONCURRENCY = 1` (default). 2. `request_concurrency > 1`. 3. The GitLab API returns retriable responses (429 with `Retry-After`, or a 5xx) that the retry layer absorbs and eventually succeeds with a 201 carrying a job. This is the shape of a runner hitting a rate-limited `/api/v4/jobs/request` while jobs are available, or hitting a GitLab instance with transient 5xxs. ## Actual behavior On a 429 (or retriable 5xx) followed by a successful retry returning a 201 with a job: - `network/retry_requester.go` honours `Retry-After` / `RateLimit-ResetTime` within that request's retry sequence. - The final 201 reaches `commands/multi.go`, which calls `buildsHelper.releaseRequest(runner, jobData != nil)` with `hasJob=true`. - `releaseRequest` can't tell a clean first-try 201 from a retry-recovered 201, and treats both as a successful pickup, scaling adaptive up by 10%. Over sustained pressure the limit hits `request_concurrency` and stays there, so the runner keeps issuing the maximum number of concurrent requests despite the server signalling slow-down on every one. Not specific to 429: the same path applies to retriable 5xxs. ## Expected behavior A request that needed retries is not evidence of capacity; the server declined the first attempt. The controller should: 1. Not scale up on retried requests. 2. Actively shrink concurrency on an explicit slow-down signal (AIMD multiplicative decrease), so a runner sitting at the cap during a healthy window collapses back toward the floor once pressure begins, rather than taking many small steps. ## Possible fixes Two code paths plus a signal between them: 1. `network/retry_requester.go`: expose whether the retry layer had to retry during a request. A context-scoped tracker (`network.WithRetryTracker(ctx) (ctx, *atomic.Bool)`) mirrors `net/http/httptrace.WithClientTrace` and avoids rippling a new return value through `client.do`, `doJSON`, `doMeasuredJSON`, `RequestJob`, and `common.Network`. 2. `commands/builds_helper.go`: `releaseRequest` currently uses `hasJob` only. Extend it to also read the retry signal: - `hasJob && !retried`: ×1.1 (unchanged, clean first-try success). - `retried`: ×0.5 (AIMD multiplicative decrease on explicit slow-down). - Otherwise: ×0.95 (unchanged, empty or failed clean response). 3. `commands/multi.go`: install the tracker around the `RequestJob` call and read it when calling `releaseRequest`.

issue