Fix runner job count partition fanout (!241016) · Merge requests · GitLab.org / GitLab

Why

runnerJobCount is a top latency contributor to the Pipeline Execution error budget (graphql_query SLI 0.955 vs SLO 0.9995; ~12% of calls > 5s). EXPLAIN confirmed the cause is partition fan-out — the per-runner LIMIT 1001 LATERAL has no partition_id filter, so Postgres probes the (runner_id, id) index on every p_ci_builds partition. Partitions are monthly, so fan-out grows ~12/year — a monotonic regression. Not a missing index. Solves part 1 of the 3 issues solved here

Change

Behind runner_job_count_recent_partitions (default off), scope the LATERAL to:

partition_id IN (Ci::Partition.recent_ids)

(current + 2 recent, integer literals → plan-time pruning). Probed partitions capped at 3, constant. No index change, no migration.

Behavior change ⚠️ (please review)

With the flag on, jobCount reflects ~the last 3 months (capped 1,000+), not all-time. The field drives the "Jobs" column in the runner list/detail UI — an "is this runner active?" signal (UI shows 0 / N / 1,000+), not a historical total — so recent-activity scoping is better aligned with intent, and a runner idle > 3 months correctly reads low.

Evidence

Latency (Kibana, p95 of json.duration_s, 7d): Kibana link — weekday p95 ~8s (max ~12s), median ~0.24s; ~70k calls > 5s/weekday
EXPLAIN before (fan-out, ~13 partitions): https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/52988/commands/154665
EXPLAIN after (pruned to 3 partitions): https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/52988/commands/154666

Rollout

Enable runner_job_count_recent_partitions gradually; watch graphql_query apdex + runnerJobCount p95. FF issue

Edited Jun 17, 2026 by Shabini Rajadas

Fix runner job count partition fanout

Why

Change

Behavior change ⚠️ (please review)

Evidence

Rollout

Merge request reports