Pipeline Execution error budget: graphql_query latency fix
## Summary
The [Pipeline Execution stage group](https://dashboards.gitlab.net/d/stage-groups-detail-pipeline_execution/67fd354?orgId=1&from=now-28d%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main) is over its 28-day error budget
(availability **0.99894** vs SLO **0.9995**). The burn is almost
entirely one component — `graphql_query` (0.9552) — and it's
**latency, not errors** (component apdex ≈ 0.912, error ratio ≈
0.00007).
Three GraphQL operations drive the spend; this issue tracks fixing them.
---
## Evidence
**Grafana / Mimir** (28d, `stage_group="pipeline_execution"`, gprd):
| Component | Availability |
|---|---|
| `graphql_query` | 0.9552 |
| `pipeline_creation_sidekiq` | 0.9988 |
| `rails_request` | 0.9996 |
| (all others) | ≥ 0.9998 |
Dashboard: https://dashboards.gitlab.net/d/stage-groups-detail-pipeline_execution/ (28d)
**Kibana** (`pubsub-rails-inf-gprd-*`, GraphQL ops in PE feature
categories, ranked by total time):
| Operation | calls/6h | avg | p95 | max | >5s/6h |
|---|---|---|---|---|---|
| `runnerJobCount` | 100k | 1.46s | 7.7s | 18.8s | 11,366 |
| `getPipelines` | 555k | 0.98s | 1.82s | 30.4s | 264 (+ execution timeouts) |
| `getPipelineDetails` | 2.15M | 0.33s | 0.89s | 26.8s | 1,397 |
`runnerJobCount` latency distribution (7d, n ≈ 3.04M): ~11.6% of calls
> 5s (~70k slow calls per weekday), p95 ~7.7s, tail > 20s.
---
## Root cause per operation
### 1. `runnerJobCount` — partition fan-out (confirmed via Database Lab EXPLAIN)
Resolver `Resolvers::Ci::RunnerJobCountResolver` runs, per runner
(`BatchLoader` → `id IN (...)`), a:
```sql
LATERAL SELECT 1 FROM p_ci_builds WHERE runner_id = ci_runners.id LIMIT 1001
```
`EXPLAIN (ANALYZE)` confirms the `(runner_id, id)` index exists and is
used (Index Only Scan on every partition) — this is **not** a
missing-index problem. The cost is **partition fan-out**: no
`partition_id` predicate → it probes the index on every `p_ci_builds`
partition; most return 0 rows for a given runner (~110ms wasted per
probe), matches concentrate in the hot recent partition (heap-fetch
penalty). Amplified by the `BatchLoader` (N runners × M partitions per
request).
Clone: ~380ms for one runner; prod p95 ~7.7s from batching × fan-out ×
cold cache.
- EXPLAIN: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/52718
- Files: `app/graphql/resolvers/ci/runner_job_count_resolver.rb`,
`app/graphql/types/ci/runner_type.rb` (`JOB_COUNT_LIMIT = 1000`)
> **Note:** the field is documented as "limited to 1000, plus one to
> indicate more exist" — an exact all-time count is not required;
> approximate/capped/cached is acceptable.
### 2. `getPipelines` — over-fetching hitting execution timeouts
Complexity ~161, depth 7, ~95 fields, `first: 15`, resolving per-pipeline:
`stages → detailedStatus`, `commit` (Gitaly), `mergeRequest`,
`failedJobsCount`, `downstream`/`upstream`. Slowest calls (20–30s) hit
the GraphQL execution timeout (`graphql_errors` full of `Timeout` on
`DetailedStatus.id` / `Pipeline.mergeRequest` / `Pipeline.commit` /
`BaseConnection.pageInfo`). Highest-volume, broad apdex burn.
### 3. `getPipelineDetails` — volume + known DB regression
~2.15M calls/6h; modest per-call latency but dominates total time.
Relates to a previously-identified `getPipelineDetails` DB regression.
---
## Proposed next steps
- [ ] **`runnerJobCount`** (highest-confidence fix): eliminate the
cross-partition fan-out. Evaluate, in order:
- (a) bound the scan to active/recent `partition_id`(s)
- (b) cached/denormalized capped counter (counter table / Redis /
column) read by the resolver
- (c) statistics-based approximate count above the cap
Not an index change. Get DB-reviewer sign-off before any migration
(hot CI path).
- [ ] **`getPipelines`**: profile and reduce per-pipeline field cost
(async/bounded resolution of `failedJobsCount`, `detailedStatus`,
`commit`); confirm via EXPLAIN/field timings which fields blow the
execution timeout.
- [ ] **`getPipelineDetails`**: confirm current state of the DB
regression and whether volume warrants caching/field trimming.
- [ ] Add/confirm SLI alerting so this regresses loudly next time
rather than only showing on the 28d budget.
---
## Acceptance criteria
- [ ] `runnerJobCount` p95 < ~1s and >5s calls reduced by ≥ 90%
(Kibana: `json.meta.caller_id: "graphql:runnerJobCount"`,
`json.duration_s`)
- [ ] `EXPLAIN` of the new `runnerJobCount` query on a Database Lab
clone shows **no `Append` scaling with partition count**, for a single
runner and a batched `id IN (...)`
- [ ] `pipeline_execution` `graphql_query` component apdex recovers and
28d availability returns above **0.9995**
---
## How to verify
1. Re-run `EXPLAIN` via Database Lab (`ci` / `gitlab-production-ci`)
for the new queries.
2. Watch p95 / >5s counts in Kibana and the stage-group dashboard apdex.
---
## References
- Stage-group dashboard (28d): https://dashboards.gitlab.net/d/stage-groups-detail-pipeline_execution/
- Database Lab EXPLAIN (`runnerJobCount`): https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/52718
- Google doc detailed explanation: https://docs.google.com/document/d/1IePuyjUwib-vWhAb1xcv7zcVab5_SnrwUIB4EKcX73E/edit?tab=t.0
/cc @pipeline-execution
issue