CI partition pruning: cache record_id → partition_id in Redis to fix unpruned Sidekiq queries
## Problem Several high-frequency Sidekiq workers query CI partitioned tables (`p_ci_pipelines`, `p_ci_builds`) without a `partition_id`, causing full cross-partition scans. This was identified as a primary contributor to the LockManager LWLock saturation on CI replicas in [incident INC-8367](https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21528) and its [incident review](https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21534#note_3158996366). The three highest call-rate offending queries are: | Query fingerprint | Query | Issue | |---|---|---| | `-6756117007090651256` | `SELECT p_ci_pipelines.* WHERE id = $1` | No `partition_id` — Sidekiq workers only enqueue with `pipeline_id` | | `4087446483174191214` | `SELECT p_ci_builds.* WHERE type IN (...) AND id = $4` | Partition pruning capable but [not wired up in `Ci::Deployable`](https://gitlab.com/gitlab-org/gitlab/-/blob/duo-edit-20260311-163221/app/models/concerns/ci/deployable.rb?plain=1#L18) | | `7090543644733999668` | `SELECT p_ci_builds.* WHERE token_encrypted IN (...)` | Possibly a fallback path when partition is missing | As the number of CI partitions grows, each of these non-pruned queries becomes progressively more expensive — this is a compounding problem that will worsen over time. Changing all worker signatures to accept a `partition_id` argument is not feasible in the short term: it requires coordinated multi-release changes to every affected worker. ## Proposal Store `record_id → partition_id` mappings in Redis at record creation time for **all partitioned CI resources** (pipelines, builds, and any future partitioned CI tables), then patch the CI finder methods to transparently hydrate `partition_id` before querying — with no worker signature changes required. The key insight is that all CI partitioned resources share the same `partition_id` as their parent pipeline (it cascades down via `Ci::Partitionable`). This means a single pipeline-keyed cache entry covers lookups for all child resources of that pipeline. ### Redis design Use `Gitlab::Redis::SharedState` (not `Cache`, to survive memory pressure and worker retries). **For pipelines** — written at pipeline creation time, keyed by `pipeline_id`: ```ruby # Key pattern — hash-tag on pipeline_id for Redis Cluster slot affinity key = "ci:partition:pipeline:{#{pipeline_id}}" Gitlab::Redis::SharedState.with { |r| r.set(key, partition_id, ex: 14.days.to_i) } ``` **For builds and other child resources** — since `partition_id` cascades from the pipeline, child resources can resolve via the pipeline cache. However, for workers that only receive a `build_id` (with no `pipeline_id`), we also store a direct mapping at creation time: ```ruby # Written at build creation time key = "ci:partition:build:{#{build_id}}" Gitlab::Redis::SharedState.with { |r| r.set(key, partition_id, ex: 14.days.to_i) } ``` This pattern generalises to any new partitioned CI resource: at creation time, write `ci:partition:<resource_type>:{<record_id>}` → `partition_id`. **Key design decisions:** - **TTL of 14 days** — covers the practical window where Sidekiq jobs referencing a record are still being enqueued. Cold records beyond this window have lower query volume and the cross-partition scan cost is more acceptable. - **Hash-tag `{record_id}`** — ensures Redis Cluster slot affinity, consistent with existing patterns (e.g. Namespace config cache). - **`SharedState` not `Cache`** — `Cache` can be evicted under memory pressure; a silent miss would degrade to a full scan rather than fail loudly. - **Value is just the integer `partition_id`** — minimal memory footprint. At ~50 bytes/key, even storing entries for both pipelines and builds is well within bounds. - **Write on record creation** — single write point per resource type, e.g. `after_create` callback in `Ci::Partitionable` so all current and future partitioned models get it automatically. ### Finder patch design Introduce a generic `Ci::PartitionIdResolver` that resolves `partition_id` for any partitioned CI resource type: ```ruby module Ci module PartitionIdResolver REDIS_TTL = 14.days.to_i KEY_PREFIX = "ci:partition" # resource_type: :pipeline, :build, etc. def self.resolve(resource_type, record_id) key = "#{KEY_PREFIX}:#{resource_type}:{#{record_id}}" cached = Gitlab::Redis::SharedState.with { |r| r.get(key) } return cached.to_i if cached # Cold path: DB fallback — cross-partition scan but only on cache miss model = model_for(resource_type) record = model.find_by(id: record_id) return nil unless record # Backfill cache on miss Gitlab::Redis::SharedState.with { |r| r.set(key, record.partition_id, ex: REDIS_TTL) } record.partition_id end def self.model_for(resource_type) { pipeline: Ci::Pipeline, build: Ci::Build }.fetch(resource_type) end end end ``` Patch the base finder via a concern included in `Ci::ApplicationRecord` (or `Ci::Partitionable`) so all partitioned models get it automatically: ```ruby module Ci module PartitionAwareFinder extend ActiveSupport::Concern class_methods do # Resolves partition_id from Redis and injects it into the query. # resource_type defaults to the model's own type (e.g. :build for Ci::Build), # but can be overridden when the caller has a related resource id # (e.g. looking up a build by pipeline_id). def find_by_id_with_partition(id, resource_type: nil, resource_id: nil) resource_type ||= inferred_resource_type resource_id ||= id partition_id = Ci::PartitionIdResolver.resolve(resource_type, resource_id) if partition_id find_by(id: id, partition_id: partition_id) else find_by(id: id) # graceful degradation end end private def inferred_resource_type name.demodulize.underscore.to_sym # Ci::Build => :build, Ci::Pipeline => :pipeline end end end end ``` This means: - `Ci::Pipeline.find_by_id_with_partition(pipeline_id)` — resolves via `ci:partition:pipeline:{pipeline_id}` - `Ci::Build.find_by_id_with_partition(build_id)` — resolves via `ci:partition:build:{build_id}` - `Ci::Build.find_by_id_with_partition(build_id, resource_type: :pipeline, resource_id: pipeline_id)` — resolves via the pipeline cache when the worker has a `pipeline_id` available (avoids a separate build cache entry) - Any future partitioned CI model gets the same behaviour automatically by including `Ci::Partitionable` ### Out of scope for this approach The `token_encrypted` query (`7090543644733999668`) is a different shape — it is not a lookup by `id` — so the `record_id → partition_id` cache does not directly help. This needs a separate investigation (possibly a `token → partition_id` cache, or fixing the fallback path when a partition is missing). The `Ci::Deployable` wiring for query `4087446483174191214` is a simpler fix that can be done independently. ### Rollout 1. **Phase 1** — Add the Redis write to `Ci::Partitionable` (covers all current and future partitioned models at once) + deploy `PartitionIdResolver`, behind a feature flag. No worker changes. 2. **Phase 2** — Enable the finder patch for the highest call-rate workers first (those driving the `p_ci_pipelines` and `p_ci_builds` scans). 3. **Phase 3 (long term)** — Gradually migrate worker signatures to pass `partition_id` explicitly; Redis lookup becomes a fast-path confirmation. ## References - Incident review: https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21534#note_3158996366 - CI partitioning design doc: https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_data_decay/pipeline_partitioning/ - `Ci::Deployable` unconnected pruning: https://gitlab.com/gitlab-org/gitlab/-/blob/duo-edit-20260311-163221/app/models/concerns/ci/deployable.rb?plain=1#L18 - `Ci::Partitionable` concern: https://gitlab.com/gitlab-org/gitlab/-/blob/920147293ae304639915f66b260dc14e4f629850/app/models/concerns/ci/partitionable.rb
issue