CI partition pruning: cache record_id → partition_id in Redis to fix unpruned Sidekiq queries
## Problem
Several high-frequency Sidekiq workers query CI partitioned tables (`p_ci_pipelines`, `p_ci_builds`) without a `partition_id`, causing full cross-partition scans. This was identified as a primary contributor to the LockManager LWLock saturation on CI replicas in [incident INC-8367](https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21528) and its [incident review](https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21534#note_3158996366).
The three highest call-rate offending queries are:
| Query fingerprint | Query | Issue |
|---|---|---|
| `-6756117007090651256` | `SELECT p_ci_pipelines.* WHERE id = $1` | No `partition_id` — Sidekiq workers only enqueue with `pipeline_id` |
| `4087446483174191214` | `SELECT p_ci_builds.* WHERE type IN (...) AND id = $4` | Partition pruning capable but [not wired up in `Ci::Deployable`](https://gitlab.com/gitlab-org/gitlab/-/blob/duo-edit-20260311-163221/app/models/concerns/ci/deployable.rb?plain=1#L18) |
| `7090543644733999668` | `SELECT p_ci_builds.* WHERE token_encrypted IN (...)` | Possibly a fallback path when partition is missing |
As the number of CI partitions grows, each of these non-pruned queries becomes progressively more expensive — this is a compounding problem that will worsen over time.
Changing all worker signatures to accept a `partition_id` argument is not feasible in the short term: it requires coordinated multi-release changes to every affected worker.
## Proposal
Store `record_id → partition_id` mappings in Redis at record creation time for **all partitioned CI resources** (pipelines, builds, and any future partitioned CI tables), then patch the CI finder methods to transparently hydrate `partition_id` before querying — with no worker signature changes required.
The key insight is that all CI partitioned resources share the same `partition_id` as their parent pipeline (it cascades down via `Ci::Partitionable`). This means a single pipeline-keyed cache entry covers lookups for all child resources of that pipeline.
### Redis design
Use `Gitlab::Redis::SharedState` (not `Cache`, to survive memory pressure and worker retries).
**For pipelines** — written at pipeline creation time, keyed by `pipeline_id`:
```ruby
# Key pattern — hash-tag on pipeline_id for Redis Cluster slot affinity
key = "ci:partition:pipeline:{#{pipeline_id}}"
Gitlab::Redis::SharedState.with { |r| r.set(key, partition_id, ex: 14.days.to_i) }
```
**For builds and other child resources** — since `partition_id` cascades from the pipeline, child resources can resolve via the pipeline cache. However, for workers that only receive a `build_id` (with no `pipeline_id`), we also store a direct mapping at creation time:
```ruby
# Written at build creation time
key = "ci:partition:build:{#{build_id}}"
Gitlab::Redis::SharedState.with { |r| r.set(key, partition_id, ex: 14.days.to_i) }
```
This pattern generalises to any new partitioned CI resource: at creation time, write `ci:partition:<resource_type>:{<record_id>}` → `partition_id`.
**Key design decisions:**
- **TTL of 14 days** — covers the practical window where Sidekiq jobs referencing a record are still being enqueued. Cold records beyond this window have lower query volume and the cross-partition scan cost is more acceptable.
- **Hash-tag `{record_id}`** — ensures Redis Cluster slot affinity, consistent with existing patterns (e.g. Namespace config cache).
- **`SharedState` not `Cache`** — `Cache` can be evicted under memory pressure; a silent miss would degrade to a full scan rather than fail loudly.
- **Value is just the integer `partition_id`** — minimal memory footprint. At ~50 bytes/key, even storing entries for both pipelines and builds is well within bounds.
- **Write on record creation** — single write point per resource type, e.g. `after_create` callback in `Ci::Partitionable` so all current and future partitioned models get it automatically.
### Finder patch design
Introduce a generic `Ci::PartitionIdResolver` that resolves `partition_id` for any partitioned CI resource type:
```ruby
module Ci
module PartitionIdResolver
REDIS_TTL = 14.days.to_i
KEY_PREFIX = "ci:partition"
# resource_type: :pipeline, :build, etc.
def self.resolve(resource_type, record_id)
key = "#{KEY_PREFIX}:#{resource_type}:{#{record_id}}"
cached = Gitlab::Redis::SharedState.with { |r| r.get(key) }
return cached.to_i if cached
# Cold path: DB fallback — cross-partition scan but only on cache miss
model = model_for(resource_type)
record = model.find_by(id: record_id)
return nil unless record
# Backfill cache on miss
Gitlab::Redis::SharedState.with { |r| r.set(key, record.partition_id, ex: REDIS_TTL) }
record.partition_id
end
def self.model_for(resource_type)
{ pipeline: Ci::Pipeline, build: Ci::Build }.fetch(resource_type)
end
end
end
```
Patch the base finder via a concern included in `Ci::ApplicationRecord` (or `Ci::Partitionable`) so all partitioned models get it automatically:
```ruby
module Ci
module PartitionAwareFinder
extend ActiveSupport::Concern
class_methods do
# Resolves partition_id from Redis and injects it into the query.
# resource_type defaults to the model's own type (e.g. :build for Ci::Build),
# but can be overridden when the caller has a related resource id
# (e.g. looking up a build by pipeline_id).
def find_by_id_with_partition(id, resource_type: nil, resource_id: nil)
resource_type ||= inferred_resource_type
resource_id ||= id
partition_id = Ci::PartitionIdResolver.resolve(resource_type, resource_id)
if partition_id
find_by(id: id, partition_id: partition_id)
else
find_by(id: id) # graceful degradation
end
end
private
def inferred_resource_type
name.demodulize.underscore.to_sym # Ci::Build => :build, Ci::Pipeline => :pipeline
end
end
end
end
```
This means:
- `Ci::Pipeline.find_by_id_with_partition(pipeline_id)` — resolves via `ci:partition:pipeline:{pipeline_id}`
- `Ci::Build.find_by_id_with_partition(build_id)` — resolves via `ci:partition:build:{build_id}`
- `Ci::Build.find_by_id_with_partition(build_id, resource_type: :pipeline, resource_id: pipeline_id)` — resolves via the pipeline cache when the worker has a `pipeline_id` available (avoids a separate build cache entry)
- Any future partitioned CI model gets the same behaviour automatically by including `Ci::Partitionable`
### Out of scope for this approach
The `token_encrypted` query (`7090543644733999668`) is a different shape — it is not a lookup by `id` — so the `record_id → partition_id` cache does not directly help. This needs a separate investigation (possibly a `token → partition_id` cache, or fixing the fallback path when a partition is missing).
The `Ci::Deployable` wiring for query `4087446483174191214` is a simpler fix that can be done independently.
### Rollout
1. **Phase 1** — Add the Redis write to `Ci::Partitionable` (covers all current and future partitioned models at once) + deploy `PartitionIdResolver`, behind a feature flag. No worker changes.
2. **Phase 2** — Enable the finder patch for the highest call-rate workers first (those driving the `p_ci_pipelines` and `p_ci_builds` scans).
3. **Phase 3 (long term)** — Gradually migrate worker signatures to pass `partition_id` explicitly; Redis lookup becomes a fast-path confirmation.
## References
- Incident review: https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21534#note_3158996366
- CI partitioning design doc: https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_data_decay/pipeline_partitioning/
- `Ci::Deployable` unconnected pruning: https://gitlab.com/gitlab-org/gitlab/-/blob/duo-edit-20260311-163221/app/models/concerns/ci/deployable.rb?plain=1#L18
- `Ci::Partitionable` concern: https://gitlab.com/gitlab-org/gitlab/-/blob/920147293ae304639915f66b260dc14e4f629850/app/models/concerns/ci/partitionable.rb
issue