Fix runner job count partition fanout

Why

runnerJobCount is a top latency contributor to the Pipeline Execution error budget (graphql_query SLI 0.955 vs SLO 0.9995; ~12% of calls > 5s). EXPLAIN confirmed the cause is partition fan-out — the per-runner LIMIT 1001 LATERAL has no partition_id filter, so Postgres probes the (runner_id, id) index on every p_ci_builds partition. Partitions are monthly, so fan-out grows ~12/year — a monotonic regression. Not a missing index. Solves part 1 of the 3 issues solved here

Change

Behind runner_job_count_recent_partitions (default off), scope the LATERAL to:

partition_id IN (Ci::Partition.recent_ids)

(current + 2 recent, integer literals → plan-time pruning). Probed partitions capped at 3, constant. No index change, no migration.

Behavior change ⚠️ (please review)

With the flag on, jobCount reflects ~the last 3 months (capped 1,000+), not all-time. The field drives the "Jobs" column in the runner list/detail UI — an "is this runner active?" signal (UI shows 0 / N / 1,000+), not a historical total — so recent-activity scoping is better aligned with intent, and a runner idle > 3 months correctly reads low.

Evidence

Rollout

Enable runner_job_count_recent_partitions gradually; watch graphql_query apdex + runnerJobCount p95. FF issue

Edited by Shabini Rajadas

Merge request reports

Loading