[FF] Cohort 3 ApplicationRateLimiter to labkit migration flags rollout
## Summary This issue tracks the rollout of the cohort 3 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/235212, which routes the next set of `ApplicationRateLimiter` keys through `Labkit::RateLimit::Limiter`. Cohort 3 is the first cohort whose call graph contains `.peek` (read-without-increment) callers. The two flags: - `rate_limiter_use_labkit_cohort_3` opts the cohort into the labkit path (shadow mode). - `rate_limiter_use_labkit_cohort_3_enforce` lets labkit's decision win over legacy. Cohort 3 covers six rate-limit keys (four CE + two EE): | Key | CE/EE | Characteristics | |---|---|---| | `glql` | CE | `[query_sha]` | | `permanent_email_failure` | CE | `[email]` | | `temporary_email_failure` | CE | `[email]` | | `update_namespace_name` | CE | `[namespace]` | | `hard_phone_verification_transactions_limit` | EE | `[scope]` | | `soft_phone_verification_transactions_limit` | EE | `[scope]` | `web_hook_calls{,_low,_mid}` is deliberately excluded: every caller passes a `threshold:` override the labkit path cannot honor; it would always route to legacy via `record_override`. See the cohort issue for design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810. ## Owners - Most appropriate Slack channel to reach out to: `#proj-ai-to-prod-rate-limits` - Best individual to reach out to: @mwoolf ## Expectations ### What are we expecting to happen? In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter, but legacy's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a 24-hour shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise (`boundary="true"`). In enforce mode (both flags on) the labkit path's decision blocks users; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree. This is the first cohort with peek callers. Peek dispatches through labkit's `Limiter#peek` (no INCR, no TTL extension), so the labkit counter is only advanced by paired non-peek call sites and the two paths remain comparable for shadow validation. ### What can go wrong and how would we detect it? - Divergence between legacy and labkit decisions above the 0.5% target — visible in `gitlab_rate_limiter_labkit_shadow_total{agreement="diverge", boundary="false"}` on Grafana. - A cohort 3 key receives an unexpected per-call `threshold:` or `interval:` override — visible in `gitlab_rate_limiter_labkit_override_total{key=...}`. Cohort 3 should be empty in this metric; if anything appears, a caller is bypassing the labkit path silently and the key needs to be unregistered or the override removed. - Increased Redis latency from the additional round-trip per check. Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance. - Throttle thresholds firing earlier or later than expected — visible as anomalies in `gitlab_application_rate_limiter_throttle_utilization_ratio` for the affected `throttle_key`. - Identity-verification regressions on `hard_/soft_phone_verification_transactions_limit` — these gate phone-verification flow decisions in EE. Failure modes appear as users misclassified into the high-risk path or as Arkose data-exchange payload differences. The Grafana go/no-go queries (Q1–Q7) for shadow and enforce phases are documented at https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810#note_3328172110. ## Rollout Steps Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command. ### Operating constraints - **Binary on/off, not percentage.** The adapter uses `Feature.current_request` as the actor; for non-request callers it resolves to a per-call UUID and percentage rollouts behave non-deterministically. Toggle fully on or fully off. - **No surgical kill-switch within the cohort.** If shadow divergence on a single key exceeds tolerance, the mitigation is to drop that key from the registry and reship; toggling the flag affects all six keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`. ### Rollout on non-production environments - [ ] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/235212 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`. - [ ] Enable shadow on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3 true --dev --pre --staging --staging-ref`. - [ ] Verify shadow counters appear in dashboards for the cohort 3 keys (especially `glql` and the EE phone-verification pair, since those exercise the new EE registration and Symbol-scope path). - [ ] Run the shadow window for 24 hours total on staging-canary before promoting to production shadow. ### Rollout on production For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel. - [ ] Enable shadow in production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3 true`. - [ ] Wait 60+ seconds for Flipper L1 cache to propagate to all puma workers, then verify shadow counters incrementing per-key (Q1). - [ ] Run the shadow window for 24 hours. Pass criteria from the rollout-plan comment: - Q2: per-key divergence < 0.5% from `gitlab_rate_limiter_labkit_shadow_total{boundary="false"}`. - Q3: no cohort 3 key in `gitlab_rate_limiter_labkit_override_total`. - Q4: labkit error rate small relative to calls. - [ ] If shadow passes: enable enforce with `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3_enforce true`. - [ ] Wait 60+ seconds, then verify legacy Redis counters for cohort 3 keys have stopped incrementing while labkit counters continue. - [ ] Run the enforce window for 24 hours. Pass criteria: - Q5: p99 utilization within 10% of pre-flip baseline. - Q6: post-enforce labkit block rate matches Phase 1 legacy block rate within boundary noise. - Q7: no 429 / 5xx spike on affected feature-category dashboards (notifications, analytics, identity verification, groups_and_projects). - [ ] If shadow or enforce fails: disable the relevant flag and investigate before retrying. ### Preparation before global rollout - [ ] Set a milestone on this issue once both flags are stable in production. - [ ] No external API consumer impact expected (these flags only change the implementation behind `ApplicationRateLimiter#throttled?`; the public Boolean return is unchanged). ### Release the feature After both cohort 3 flags have been stable in production for at least one week, open a follow-up cleanup MR to: - Remove the cohort 3 entries from `lib/gitlab/application_rate_limiter.rb`'s legacy `rate_limits` hash once all cohorts have cut over (alongside cohort 1 and 2). - Remove the cohort 3 dispatch branch from `_throttled?` once all cohorts have cut over. - Run `/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_3 --dev --pre --staging --staging-ref --production` and similarly for `_enforce`. ## Rollback Steps For the whole cohort: - [ ] Disable enforce: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3_enforce false` (restores legacy enforcement within seconds; shadow stays on so divergence keeps reporting). - [ ] If shadow itself is the problem: also `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3 false`. - [ ] Verify legacy counters resume incrementing for cohort 3 keys. For a single misbehaving key (without rolling back the whole cohort): - [ ] Drop the key's `SupportedRateLimits` entry and reship as a follow-up MR. - [ ] Confirm the next deploy removes the key from cohort 3 dispatch. To delete the flags from all environments after rollback: ``` /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_3 --dev --pre --staging --staging-ref --production /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_3_enforce --dev --pre --staging --staging-ref --production ``` ## References - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/235212 - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810 - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28808 - https://gitlab.com/gitlab-org/gitlab/-/work_items/598560 (cohort 1 & 2 rollout) - https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 (parent epic)
issue