[FF] Cohort 5 ApplicationRateLimiter to labkit migration flags rollout (#600439) · Issues · GitLab.org / GitLab

[FF] Cohort 5 ApplicationRateLimiter to labkit migration flags rollout

## Summary This issue tracks the rollout of the cohort 5 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237240, which routes the three `IncrementResourceUsagePerAction` keys through the labkit adapter in cost-mode (`check(cost:)` via `INCRBYFLOAT`). The two flags: - `rate_limiter_use_labkit_cohort_5` opts the cohort into the labkit path (shadow mode). - `rate_limiter_use_labkit_cohort_5_enforce` lets labkit's decision win over legacy. Cohort 5 covers three resource-usage rate-limit keys (all CE, all Sidekiq-side): | Key | Resource key (SafeRequestStore) | Characteristics | |---|---|---| | `main_db_duration_limit_per_worker` | `db_main_duration_s` | `[worker_name]` | | `ci_db_duration_limit_per_worker` | `db_ci_duration_s` | `[worker_name]` | | `sec_db_duration_limit_per_worker` | `db_sec_duration_s` | `[worker_name]` | Unlike cohorts 1 to 4 which count requests, cohort 5 accumulates fractional resource usage (database duration in seconds) per worker per minute via `INCRBYFLOAT`. Threshold and interval are sourced per call from `Gitlab::SidekiqLimits.limits_for` (which resolves the worker's urgency rule and any ApplicationSetting override) and forwarded to the labkit Rule via `rule_context`. A single `cost_mode: true` registry flag marks these entries; no `_from_caller` overrides are needed. See the cohort issue for design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28812. ## Owners - Most appropriate Slack channel to reach out to: `#proj-ai-to-prod-rate-limits` - Best individual to reach out to: @mwoolf ## Expectations ### What are we expecting to happen? In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter by the same DB duration value that legacy applies via `INCRBYFLOAT`. Legacy's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise (`boundary="true"`). In enforce mode (both flags on) the labkit path's decision blocks Sidekiq workers; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree, since these throttles only affect Sidekiq job execution, not direct user requests. Two semantic specialities worth highlighting: 1. **Zero-usage short-circuit.** When a worker finishes with zero DB time, the dispatch returns early without creating a labkit counter, matching legacy's `IncrementResourceUsagePerAction#increment` behavior. This keeps the shadow counter clean of synthetic match events from zero-usage ticks. 2. **Caller-supplied threshold and interval.** Cohort 5 keys are not in `Gitlab::ApplicationRateLimiter.rate_limits`. Their threshold and interval flow per call from `SidekiqLimits.limits_for`, which resolves any ApplicationSetting override upstream. `gitlab_rate_limiter_labkit_override_total` MUST stay empty for cohort 5 keys; anything appearing there indicates a dispatch regression (Q3 catches this). ### What can go wrong and how would we detect it? - Divergence between legacy and labkit decisions above the 0.5% target. Visible in `gitlab_rate_limiter_labkit_shadow_total{agreement="diverge", boundary="false"}` per cohort 5 key (Q2). - A cohort 5 key appears in `gitlab_rate_limiter_labkit_override_total`. This would indicate the dispatch is treating the caller-supplied threshold/interval as overrides; the keys would silently route to legacy (Q3). - Zero-usage workers start throttling. Detect via Q2 divergence rising and Q7 (Sidekiq retry/dead rates on workers that previously ran clean). - Increased Redis latency from the additional round-trip per check. Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance. - Throttle thresholds firing earlier or later than expected. Visible as anomalies in Q5. - Sidekiq workers entering retry loops because they hit the limit, then retry into the same limit. Watch Q7 and `:sidekiq_throttled` rate for the worker classes that historically hit these limits. ## Grafana queries Run on the Mimir/Prometheus datasource the rate-limiting service writes into. Substitute `$env` with the environment under inspection (`gprd`, `gstg`, etc.). ### Q1: Shadow path is alive (post-enable liveness check) ```promql sum by (key) (rate(gitlab_rate_limiter_labkit_shadow_total{ env="$env", key=~"(main|ci|sec)_db_duration_limit_per_worker" }[5m])) ``` Pass: all three keys return non-zero rates within 60 seconds of enabling `rate_limiter_use_labkit_cohort_5`. If a key is missing, either the shadow flag is off, no worker that emits its `db_*_duration_s` SafeRequestStore key has run, or the dispatch isn't wiring the key into labkit. ### Q2: Per-key divergence ratio (shadow window pass criterion) ```promql sum by (key) (rate(gitlab_rate_limiter_labkit_shadow_total{ env="$env", agreement="diverge", boundary="false", key=~"(main|ci|sec)_db_duration_limit_per_worker" }[1h])) / sum by (key) (rate(gitlab_rate_limiter_labkit_shadow_total{ env="$env", boundary="false", key=~"(main|ci|sec)_db_duration_limit_per_worker" }[1h])) ``` Pass: each key stays below `0.005` (0.5%) across the shadow window. The `boundary="false"` filter excludes the per-second window-edge skew between legacy's `divmod` `period_key` and labkit's TTL-based window. ### Q3: Override-bypass check (load-bearing for cohort 5) ```promql sum by (key) (rate(gitlab_rate_limiter_labkit_override_total{ env="$env", key=~"(main|ci|sec)_db_duration_limit_per_worker" }[5m])) ``` Pass: zero or empty for all three keys. Any non-zero rate means the dispatch is treating the caller-supplied threshold/interval as overrides and silently routing cohort 5 back to legacy; Q1 will go to zero in that case. ### Q4: Labkit error rate (shadow window) ```promql sum by (rate_limiter) (rate(gitlab_labkit_rate_limiter_errors_total{ env="$env", rate_limiter=~"applimiter_(main|ci|sec)_db_duration_limit_per_worker" }[5m])) / clamp_min( sum by (rate_limiter) (rate(gitlab_labkit_rate_limiter_calls_total{ env="$env", rate_limiter=~"applimiter_(main|ci|sec)_db_duration_limit_per_worker" }[5m])), 0.001 ) ``` Pass: < 0.001 (0.1%) sustained over the shadow window. ### Q5: Throttle utilization p99 (pre-enforce baseline, post-enforce check) ```promql histogram_quantile(0.99, sum by (le, throttle_key) ( rate(gitlab_application_rate_limiter_throttle_utilization_ratio_bucket{ env="$env", throttle_key=~"(main|ci|sec)_db_duration_limit_per_worker" }[5m]) ) ) ``` Snapshot the 24-hour pre-flip value as the baseline; after enforce flips, the rolling 24-hour p99 must stay within +/- 10% of that baseline. ### Q6: Labkit block-decision rate (post-enforce sanity) ```promql sum by (rate_limiter, action) (rate(gitlab_labkit_rate_limiter_calls_total{ env="$env", rate_limiter=~"applimiter_(main|ci|sec)_db_duration_limit_per_worker", action=~"block|allow" }[5m])) ``` Inspect the `action="block"` series. It should appear at roughly the same rate post-enforce as the right-tail mass of Q5 was pre-enforce. ### Q7: Sidekiq retry / dead spike for affected worker classes ```promql sum by (worker) (rate(sidekiq_jobs_retried_total{ env="$env", job_status="fail" }[5m])) sum by (worker) (rate(sidekiq_jobs_dead_total{env="$env"}[5m])) ``` Pass: no sustained increase post-enforce for workers that emit `db_{main,ci,sec}_duration_s` into SafeRequestStore. ## Rollout Steps Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command. ### Operating constraints - **Percentage-based rollouts.** The adapter uses `Feature.current_request` as the actor. In Sidekiq, `SafeRequestStore` is active per-job so the flag resolves consistently within a single job invocation; the UUID resets between jobs. Percentage rollouts therefore mean "each job invocation independently has an X% chance of hitting the labkit path", which gives proportional coverage and safe incremental ramp-up. - **No surgical kill-switch within the cohort.** If shadow divergence on a single key exceeds tolerance, the mitigation is to drop that key from `SupportedRateLimits` and reship; toggling the flag affects all three keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`. - **Zero-usage short-circuit is load-bearing.** Any change to the `cost_mode` dispatch path must preserve the `return false if check_cost == 0` guard. Removing it would create empty labkit counters for every zero-usage worker tick and pollute the divergence signal. ### Rollout on non-production environments - [x] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237240 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`. - [x] Enable shadow at 10% on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 10 --dev --pre --staging --staging-ref`. - [x] Run Q1 against `gstg`; all three keys must be incrementing. - [x] Run Q3 against `gstg`; result must be zero or empty (confirms the caller-supplied threshold/interval are not being treated as overrides). - [x] Ramp to 100% on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 100 --dev --pre --staging --staging-ref`. - [x] Run the shadow window for 24 hours on staging before promoting to production. ### Rollout on production For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel. #### Shadow rollout - [x] Enable shadow at 1%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 1`. - [x] Wait 60+ seconds, then run Q1 against `gprd` to confirm per-key counters are incrementing. - [x] Run Q3 to confirm no overrides are being recorded. - [x] Ramp to 10%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 10`. - [x] Monitor Q2 for 1 hour. Pass: divergence < 0.5% on all three keys. - [x] Ramp to 50%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 50`. - [x] Monitor Q2 + Q4 for 6 hours. Pass: divergence < 0.5%, labkit error rate < 0.1%. - [x] Ramp to 100%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 100`. - [x] Capture the 24-hour Q5 baseline. This is the reference for the post-enforce comparison. (p99 ≈ 0.248 for all three keys) - [x] Run the full shadow window for 24 hours at 100%. Pass criteria: - **Q2** per-key divergence < 0.5%. - **Q3** cohort 5 keys absent from `gitlab_rate_limiter_labkit_override_total`. - **Q4** labkit error rate < 0.1%. - **Q7** Sidekiq retry/dead rate unchanged. #### Enforce rollout - [x] Enable enforce at 1%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 1`. - [x] Wait 60+ seconds. Verify legacy Redis counters for cohort 5 keys have started reducing for the 1% slice while labkit counters continue. - [x] Monitor Q5 + Q7 for 1 hour. Pass: p99 within +/- 10% of pre-flip baseline, no retry/dead spike. - [x] Ramp enforce to 10%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 10`. - [x] Monitor Q5 + Q6 + Q7 for 6 hours. - [x] Ramp enforce to 50%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 50`. - [x] Monitor Q5 + Q6 + Q7 for 6 hours. - [x] Ramp enforce to 100%: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 100`. - [ ] Run the enforce window for 24 hours (started ~2026-06-09). Pass criteria: - **Q5** post-flip p99 within +/- 10% of the pre-flip baseline. - **Q6** `action="block"` rate matches the right-tail mass of the pre-flip Q5 within boundary noise. - **Q7** no Sidekiq retry/dead spike on the worker classes that hit these limits in production. - [x] If any ramp step fails: drop back to the previous percentage and investigate before retrying. ### Preparation before global rollout - [ ] Set a milestone on this issue once both flags are stable in production at 100%. - [ ] No external API consumer impact expected. These flags only change the implementation behind `ApplicationRateLimiter#resource_usage_throttled?`; the public Boolean return is unchanged. ### Release the feature After both cohort 5 flags have been stable in production at 100% for at least one week, open a follow-up cleanup MR to: - Remove the cohort 5 entries from `SupportedRateLimits` and the `cost_mode` handling in `LabkitAdapter` once all resource-usage callers route exclusively through labkit and the migration scaffolding can be retired. - Run `/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_5 --dev --pre --staging --staging-ref --production` and similarly for `_enforce`. ## Rollback Steps For the whole cohort: - [ ] Reduce enforce percentage or disable: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5_enforce 0`. Restores legacy enforcement within seconds; shadow stays on so divergence keeps reporting. - [ ] If shadow itself is the problem: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_5 0`. - [ ] Verify legacy counters resume incrementing for cohort 5 keys. For a single misbehaving key (without rolling back the whole cohort): - [ ] Drop the key's `SupportedRateLimits` entry and reship as a follow-up MR. - [ ] Confirm the next deploy removes the key from cohort 5 dispatch. To delete the flags from all environments after rollback: ``` /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_5 --dev --pre --staging --staging-ref --production /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_5_enforce --dev --pre --staging --staging-ref --production ``` ## References - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237240 (introducing MR) - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28812 (cohort 5 design) - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28808 (overarching migration) - https://gitlab.com/gitlab-org/gitlab/-/work_items/600841 (cohort 6 rollout) - https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 (parent epic)

issue