[FF] Cohort 5 ApplicationRateLimiter to labkit migration flags rollout
## Summary
This issue tracks the rollout of the cohort 5 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237240, which routes the three `IncrementResourceUsagePerAction` keys through the labkit adapter in cost-mode (`check(cost:)` via `INCRBYFLOAT`).
The two flags:
- `rate_limiter_use_labkit_cohort_5` opts the cohort into the labkit path (shadow mode).
- `rate_limiter_use_labkit_cohort_5_enforce` lets labkit's decision win over legacy.
Cohort 5 covers three resource-usage rate-limit keys (all CE, all Sidekiq-side):
| Key | Resource key (SafeRequestStore) | Characteristics |
|---|---|---|
| `main_db_duration_limit_per_worker` | `db_main_duration_s` | `[worker_name]` |
| `ci_db_duration_limit_per_worker` | `db_ci_duration_s` | `[worker_name]` |
| `sec_db_duration_limit_per_worker` | `db_sec_duration_s` | `[worker_name]` |
Unlike cohorts 1 to 4 which count requests, cohort 5 accumulates fractional resource usage (database duration in seconds) per worker per minute via `INCRBYFLOAT`. Threshold and interval are sourced per call from `Gitlab::SidekiqLimits.limits_for` (which resolves the worker's urgency rule and any ApplicationSetting override) and forwarded to the labkit Rule via `rule_context`. A single `cost_mode: true` registry flag marks these entries; no `_from_caller` overrides are needed.
See the cohort issue for design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28812.
## Owners
- Most appropriate Slack channel to reach out to: `#proj-ai-to-prod-rate-limits`
- Best individual to reach out to: @mwoolf
## Expectations
### What are we expecting to happen?
In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter by the same DB duration value that legacy applies via `INCRBYFLOAT`. Legacy's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise (`boundary="true"`).
In enforce mode (both flags on) the labkit path's decision blocks Sidekiq workers; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree, since these throttles only affect Sidekiq job execution, not direct user requests.
Two semantic specialities worth highlighting:
1. **Zero-usage short-circuit.** When a worker finishes with zero DB time, the dispatch returns early without creating a labkit counter, matching legacy's `IncrementResourceUsagePerAction#increment` behavior. This keeps the shadow counter clean of synthetic match events from zero-usage ticks.
2. **Caller-supplied threshold and interval.** Cohort 5 keys are not in `Gitlab::ApplicationRateLimiter.rate_limits`. Their threshold and interval flow per call from `SidekiqLimits.limits_for`, which resolves any ApplicationSetting override upstream. `gitlab_rate_limiter_labkit_override_total` MUST stay empty for cohort 5 keys; anything appearing there indicates a dispatch regression (Q3 catches this).
### What can go wrong and how would we detect it?
- Divergence between legacy and labkit decisions above the 0.5% target. Visible in `gitlab_rate_limiter_labkit_shadow_total{agreement="diverge", boundary="false"}` per cohort 5 key (Q2).
- A cohort 5 key appears in `gitlab_rate_limiter_labkit_override_total`. This would indicate the dispatch is treating the caller-supplied threshold/interval as overrides; the keys would silently route to legacy (Q3).
- Zero-usage workers start throttling. Detect via Q2 divergence rising and Q7 (Sidekiq retry/dead rates on workers that previously ran clean).
- Increased Redis latency from the additional round-trip per check. Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance.
- Throttle thresholds firing earlier or later than expected. Visible as anomalies in Q5.
- Sidekiq workers entering retry loops because they hit the limit, then retry into the same limit. Watch Q7 and `:sidekiq_throttled` rate for the worker classes that historically hit these limits.
## Grafana queries
Run on the Mimir/Prometheus datasource the rate-limiting service writes into. Substitute `$env` with the environment under inspection (`gprd`, `gstg`, etc.).
### Q1: Shadow path is alive (post-enable liveness check)
```promql
sum by (key) (rate(gitlab_rate_limiter_labkit_shadow_total{
env="$env",
key=~"(main|ci|sec)_db_duration_limit_per_worker"
}[5m]))
```
Pass: all three keys return non-zero rates within 60 seconds of enabling `rate_limiter_use_labkit_cohort_5`. If a key is missing, either the shadow flag is off, no worker that emits its `db_*_duration_s` SafeRequestStore key has run, or the dispatch isn't wiring the key into labkit.
### Q2: Per-key divergence ratio (shadow window pass criterion)
```promql
sum by (key) (rate(gitlab_rate_limiter_labkit_shadow_total{
env="$env",
agreement="diverge",
boundary="false",
key=~"(main|ci|sec)_db_duration_limit_per_worker"
}[1h]))
/
sum by (key) (rate(gitlab_rate_limiter_labkit_shadow_total{
env="$env",
boundary="false",
key=~"(main|ci|sec)_db_duration_limit_per_worker"
}[1h]))
```
Pass: each key stays below `0.005` (0.5%) across the shadow window. The `boundary="false"` filter excludes the per-second window-edge skew between legacy's `divmod` `period_key` and labkit's TTL-based window.
### Q3: Override-bypass check (load-bearing for cohort 5)
```promql
sum by (key) (rate(gitlab_rate_limiter_labkit_override_total{
env="$env",
key=~"(main|ci|sec)_db_duration_limit_per_worker"
}[5m]))
```
Pass: zero or empty for all three keys. Any non-zero rate means the dispatch is treating the caller-supplied threshold/interval as overrides and silently routing cohort 5 back to legacy; Q1 will go to zero in that case.
### Q4: Labkit error rate (shadow window)
```promql
sum by (rate_limiter) (rate(gitlab_labkit_rate_limiter_errors_total{
env="$env",
rate_limiter=~"applimiter_(main|ci|sec)_db_duration_limit_per_worker"
}[5m]))
/
clamp_min(
sum by (rate_limiter) (rate(gitlab_labkit_rate_limiter_calls_total{
env="$env",
rate_limiter=~"applimiter_(main|ci|sec)_db_duration_limit_per_worker"
}[5m])),
0.001
)
```
Pass: < 0.001 (0.1%) sustained over the shadow window.
### Q5: Throttle utilization p99 (pre-enforce baseline, post-enforce check)
```promql
histogram_quantile(0.99,
sum by (le, throttle_key) (
rate(gitlab_application_rate_limiter_throttle_utilization_ratio_bucket{
env="$env",
throttle_key=~"(main|ci|sec)_db_duration_limit_per_worker"
}[5m])
)
)
```
Snapshot the 24-hour pre-flip value as the baseline; after enforce flips, the rolling 24-hour p99 must stay within +/- 10% of that baseline.
### Q6: Labkit block-decision rate (post-enforce sanity)
```promql
sum by (rate_limiter, action) (rate(gitlab_labkit_rate_limiter_calls_total{
env="$env",
rate_limiter=~"applimiter_(main|ci|sec)_db_duration_limit_per_worker",
action=~"block|allow"
}[5m]))
```
Inspect the `action="block"` series. It should appear at roughly the same rate post-enforce as the right-tail mass of Q5 was pre-enforce.
### Q7: Sidekiq retry / dead spike for affected worker classes
```promql
sum by (worker) (rate(sidekiq_jobs_retried_total{
env="$env",
job_status="fail"
}[5m]))
sum by (worker) (rate(sidekiq_jobs_dead_total{env="$env"}[5m]))
```
Pass: no sustained increase post-enforce for workers that emit `db_{main,ci,sec}_duration_s` into SafeRequestStore.
## Rollout Steps
Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.
### Operating constraints
- **Percentage-based rollouts.** The adapter uses `Feature.current_request` as the actor. In Sidekiq, `SafeRequestStore` is active per-job so the flag resolves consistently within a single job invocation; the UUID resets between jobs. Percentage rollouts therefore mean "each job invocation independently has an X% chance of hitting the labkit path", which gives proportional coverage and safe incremental ramp-up.
- **No surgical kill-switch within the cohort.** If shadow divergence on a single key exceeds tolerance, the mitigation is to drop that key from `SupportedRateLimits` and reship; toggling the flag affects all three keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`.
- **Zero-usage short-circuit is load-bearing.** Any change to the `cost_mode` dispatch path must preserve the `return false if check_cost == 0` guard. Removing it would create empty labkit counters for every zero-usage worker tick and pollute the divergence signal.
### Rollout on non-production environments
- [x] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237240 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`.
- [x] Enable shadow at 10% on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 10 --dev --pre --staging --staging-ref`.
- [x] Run Q1 against `gstg`; all three keys must be incrementing.
- [x] Run Q3 against `gstg`; result must be zero or empty (confirms the caller-supplied threshold/interval are not being treated as overrides).
- [x] Ramp to 100% on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 100 --dev --pre --staging --staging-ref`.
- [x] Run the shadow window for 24 hours on staging before promoting to production.
### Rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel.
#### Shadow rollout
- [x] Enable shadow at 1%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 1`.
- [x] Wait 60+ seconds, then run Q1 against `gprd` to confirm per-key counters are incrementing.
- [x] Run Q3 to confirm no overrides are being recorded.
- [x] Ramp to 10%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 10`.
- [x] Monitor Q2 for 1 hour. Pass: divergence < 0.5% on all three keys.
- [x] Ramp to 50%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 50`.
- [x] Monitor Q2 + Q4 for 6 hours. Pass: divergence < 0.5%, labkit error rate < 0.1%.
- [x] Ramp to 100%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 100`.
- [x] Capture the 24-hour Q5 baseline. This is the reference for the post-enforce comparison. (p99 ≈ 0.248 for all three keys)
- [x] Run the full shadow window for 24 hours at 100%. Pass criteria:
- **Q2** per-key divergence < 0.5%.
- **Q3** cohort 5 keys absent from `gitlab_rate_limiter_labkit_override_total`.
- **Q4** labkit error rate < 0.1%.
- **Q7** Sidekiq retry/dead rate unchanged.
#### Enforce rollout
- [x] Enable enforce at 1%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 1`.
- [x] Wait 60+ seconds. Verify legacy Redis counters for cohort 5 keys have started reducing for the 1% slice while labkit counters continue.
- [x] Monitor Q5 + Q7 for 1 hour. Pass: p99 within +/- 10% of pre-flip baseline, no retry/dead spike.
- [x] Ramp enforce to 10%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 10`.
- [x] Monitor Q5 + Q6 + Q7 for 6 hours.
- [x] Ramp enforce to 50%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 50`.
- [x] Monitor Q5 + Q6 + Q7 for 6 hours.
- [x] Ramp enforce to 100%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 100`.
- [ ] Run the enforce window for 24 hours (started ~2026-06-09). Pass criteria:
- **Q5** post-flip p99 within +/- 10% of the pre-flip baseline.
- **Q6** `action="block"` rate matches the right-tail mass of the pre-flip Q5 within boundary noise.
- **Q7** no Sidekiq retry/dead spike on the worker classes that hit these limits in production.
- [x] If any ramp step fails: drop back to the previous percentage and investigate before retrying.
### Preparation before global rollout
- [ ] Set a milestone on this issue once both flags are stable in production at 100%.
- [ ] No external API consumer impact expected. These flags only change the implementation behind `ApplicationRateLimiter#resource_usage_throttled?`; the public Boolean return is unchanged.
### Release the feature
After both cohort 5 flags have been stable in production at 100% for at least one week, open a follow-up cleanup MR to:
- Remove the cohort 5 entries from `SupportedRateLimits` and the `cost_mode` handling in `LabkitAdapter` once all resource-usage callers route exclusively through labkit and the migration scaffolding can be retired.
- Run `/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_5 --dev --pre --staging --staging-ref --production` and similarly for `_enforce`.
## Rollback Steps
For the whole cohort:
- [ ] Reduce enforce percentage or disable: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 0`. Restores legacy enforcement within seconds; shadow stays on so divergence keeps reporting.
- [ ] If shadow itself is the problem: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 0`.
- [ ] Verify legacy counters resume incrementing for cohort 5 keys.
For a single misbehaving key (without rolling back the whole cohort):
- [ ] Drop the key's `SupportedRateLimits` entry and reship as a follow-up MR.
- [ ] Confirm the next deploy removes the key from cohort 5 dispatch.
To delete the flags from all environments after rollback:
```
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_5 --dev --pre --staging --staging-ref --production
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_5_enforce --dev --pre --staging --staging-ref --production
```
## References
- https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237240 (introducing MR)
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28812 (cohort 5 design)
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28808 (overarching migration)
- https://gitlab.com/gitlab-org/gitlab/-/work_items/600841 (cohort 6 rollout)
- https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 (parent epic)
issue