Pipeline Execution error budget: graphql_query latency fix
## Summary The [Pipeline Execution stage group](https://dashboards.gitlab.net/d/stage-groups-detail-pipeline_execution/67fd354?orgId=1&from=now-28d%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main) is over its 28-day error budget (availability **0.99894** vs SLO **0.9995**). The burn is almost entirely one component — `graphql_query` (0.9552) — and it's **latency, not errors** (component apdex ≈ 0.912, error ratio ≈ 0.00007). Three GraphQL operations drive the spend; this issue tracks fixing them. --- ## Evidence **Grafana / Mimir** (28d, `stage_group="pipeline_execution"`, gprd): | Component | Availability | |---|---| | `graphql_query` | 0.9552 | | `pipeline_creation_sidekiq` | 0.9988 | | `rails_request` | 0.9996 | | (all others) | ≥ 0.9998 | Dashboard: https://dashboards.gitlab.net/d/stage-groups-detail-pipeline_execution/ (28d) **Kibana** (`pubsub-rails-inf-gprd-*`, GraphQL ops in PE feature categories, ranked by total time): | Operation | calls/6h | avg | p95 | max | >5s/6h | |---|---|---|---|---|---| | `runnerJobCount` | 100k | 1.46s | 7.7s | 18.8s | 11,366 | | `getPipelines` | 555k | 0.98s | 1.82s | 30.4s | 264 (+ execution timeouts) | | `getPipelineDetails` | 2.15M | 0.33s | 0.89s | 26.8s | 1,397 | `runnerJobCount` latency distribution (7d, n ≈ 3.04M): ~11.6% of calls > 5s (~70k slow calls per weekday), p95 ~7.7s, tail > 20s. --- ## Root cause per operation ### 1. `runnerJobCount` — partition fan-out (confirmed via Database Lab EXPLAIN) Resolver `Resolvers::Ci::RunnerJobCountResolver` runs, per runner (`BatchLoader` → `id IN (...)`), a: ```sql LATERAL SELECT 1 FROM p_ci_builds WHERE runner_id = ci_runners.id LIMIT 1001 ``` `EXPLAIN (ANALYZE)` confirms the `(runner_id, id)` index exists and is used (Index Only Scan on every partition) — this is **not** a missing-index problem. The cost is **partition fan-out**: no `partition_id` predicate → it probes the index on every `p_ci_builds` partition; most return 0 rows for a given runner (~110ms wasted per probe), matches concentrate in the hot recent partition (heap-fetch penalty). Amplified by the `BatchLoader` (N runners × M partitions per request). Clone: ~380ms for one runner; prod p95 ~7.7s from batching × fan-out × cold cache. - EXPLAIN: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/52718 - Files: `app/graphql/resolvers/ci/runner_job_count_resolver.rb`, `app/graphql/types/ci/runner_type.rb` (`JOB_COUNT_LIMIT = 1000`) > **Note:** the field is documented as "limited to 1000, plus one to > indicate more exist" — an exact all-time count is not required; > approximate/capped/cached is acceptable. ### 2. `getPipelines` — over-fetching hitting execution timeouts Complexity ~161, depth 7, ~95 fields, `first: 15`, resolving per-pipeline: `stages → detailedStatus`, `commit` (Gitaly), `mergeRequest`, `failedJobsCount`, `downstream`/`upstream`. Slowest calls (20–30s) hit the GraphQL execution timeout (`graphql_errors` full of `Timeout` on `DetailedStatus.id` / `Pipeline.mergeRequest` / `Pipeline.commit` / `BaseConnection.pageInfo`). Highest-volume, broad apdex burn. ### 3. `getPipelineDetails` — volume + known DB regression ~2.15M calls/6h; modest per-call latency but dominates total time. Relates to a previously-identified `getPipelineDetails` DB regression. --- ## Proposed next steps - [ ] **`runnerJobCount`** (highest-confidence fix): eliminate the cross-partition fan-out. Evaluate, in order: - (a) bound the scan to active/recent `partition_id`(s) - (b) cached/denormalized capped counter (counter table / Redis / column) read by the resolver - (c) statistics-based approximate count above the cap Not an index change. Get DB-reviewer sign-off before any migration (hot CI path). - [ ] **`getPipelines`**: profile and reduce per-pipeline field cost (async/bounded resolution of `failedJobsCount`, `detailedStatus`, `commit`); confirm via EXPLAIN/field timings which fields blow the execution timeout. - [ ] **`getPipelineDetails`**: confirm current state of the DB regression and whether volume warrants caching/field trimming. - [ ] Add/confirm SLI alerting so this regresses loudly next time rather than only showing on the 28d budget. --- ## Acceptance criteria - [ ] `runnerJobCount` p95 < ~1s and >5s calls reduced by ≥ 90% (Kibana: `json.meta.caller_id: "graphql:runnerJobCount"`, `json.duration_s`) - [ ] `EXPLAIN` of the new `runnerJobCount` query on a Database Lab clone shows **no `Append` scaling with partition count**, for a single runner and a batched `id IN (...)` - [ ] `pipeline_execution` `graphql_query` component apdex recovers and 28d availability returns above **0.9995** --- ## How to verify 1. Re-run `EXPLAIN` via Database Lab (`ci` / `gitlab-production-ci`) for the new queries. 2. Watch p95 / >5s counts in Kibana and the stage-group dashboard apdex. --- ## References - Stage-group dashboard (28d): https://dashboards.gitlab.net/d/stage-groups-detail-pipeline_execution/ - Database Lab EXPLAIN (`runnerJobCount`): https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/52718 - Google doc detailed explanation: https://docs.google.com/document/d/1IePuyjUwib-vWhAb1xcv7zcVab5_SnrwUIB4EKcX73E/edit?tab=t.0 /cc @pipeline-execution
issue