[FF] Cohort 3 ApplicationRateLimiter to labkit migration flags rollout
## Summary
This issue tracks the rollout of the cohort 3 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/235212, which routes the next set of `ApplicationRateLimiter` keys through `Labkit::RateLimit::Limiter`. Cohort 3 is the first cohort whose call graph contains `.peek` (read-without-increment) callers.
The two flags:
- `rate_limiter_use_labkit_cohort_3` opts the cohort into the labkit path (shadow mode).
- `rate_limiter_use_labkit_cohort_3_enforce` lets labkit's decision win over legacy.
Cohort 3 covers six rate-limit keys (four CE + two EE):
| Key | CE/EE | Characteristics |
|---|---|---|
| `glql` | CE | `[query_sha]` |
| `permanent_email_failure` | CE | `[email]` |
| `temporary_email_failure` | CE | `[email]` |
| `update_namespace_name` | CE | `[namespace]` |
| `hard_phone_verification_transactions_limit` | EE | `[scope]` |
| `soft_phone_verification_transactions_limit` | EE | `[scope]` |
`web_hook_calls{,_low,_mid}` is deliberately excluded: every caller passes a `threshold:` override the labkit path cannot honor; it would always route to legacy via `record_override`.
See the cohort issue for design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810.
## Owners
- Most appropriate Slack channel to reach out to: `#proj-ai-to-prod-rate-limits`
- Best individual to reach out to: @mwoolf
## Expectations
### What are we expecting to happen?
In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter, but legacy's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a 24-hour shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise (`boundary="true"`).
In enforce mode (both flags on) the labkit path's decision blocks users; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree.
This is the first cohort with peek callers. Peek dispatches through labkit's `Limiter#peek` (no INCR, no TTL extension), so the labkit counter is only advanced by paired non-peek call sites and the two paths remain comparable for shadow validation.
### What can go wrong and how would we detect it?
- Divergence between legacy and labkit decisions above the 0.5% target — visible in `gitlab_rate_limiter_labkit_shadow_total{agreement="diverge", boundary="false"}` on Grafana.
- A cohort 3 key receives an unexpected per-call `threshold:` or `interval:` override — visible in `gitlab_rate_limiter_labkit_override_total{key=...}`. Cohort 3 should be empty in this metric; if anything appears, a caller is bypassing the labkit path silently and the key needs to be unregistered or the override removed.
- Increased Redis latency from the additional round-trip per check. Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance.
- Throttle thresholds firing earlier or later than expected — visible as anomalies in `gitlab_application_rate_limiter_throttle_utilization_ratio` for the affected `throttle_key`.
- Identity-verification regressions on `hard_/soft_phone_verification_transactions_limit` — these gate phone-verification flow decisions in EE. Failure modes appear as users misclassified into the high-risk path or as Arkose data-exchange payload differences.
The Grafana go/no-go queries (Q1–Q7) for shadow and enforce phases are documented at https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810#note_3328172110.
## Rollout Steps
Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.
### Operating constraints
- **Binary on/off, not percentage.** The adapter uses `Feature.current_request` as the actor; for non-request callers it resolves to a per-call UUID and percentage rollouts behave non-deterministically. Toggle fully on or fully off.
- **No surgical kill-switch within the cohort.** If shadow divergence on a single key exceeds tolerance, the mitigation is to drop that key from the registry and reship; toggling the flag affects all six keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`.
### Rollout on non-production environments
- [ ] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/235212 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`.
- [ ] Enable shadow on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3 true --dev --pre --staging --staging-ref`.
- [ ] Verify shadow counters appear in dashboards for the cohort 3 keys (especially `glql` and the EE phone-verification pair, since those exercise the new EE registration and Symbol-scope path).
- [ ] Run the shadow window for 24 hours total on staging-canary before promoting to production shadow.
### Rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel.
- [ ] Enable shadow in production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3 true`.
- [ ] Wait 60+ seconds for Flipper L1 cache to propagate to all puma workers, then verify shadow counters incrementing per-key (Q1).
- [ ] Run the shadow window for 24 hours. Pass criteria from the rollout-plan comment:
- Q2: per-key divergence < 0.5% from `gitlab_rate_limiter_labkit_shadow_total{boundary="false"}`.
- Q3: no cohort 3 key in `gitlab_rate_limiter_labkit_override_total`.
- Q4: labkit error rate small relative to calls.
- [ ] If shadow passes: enable enforce with `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3_enforce true`.
- [ ] Wait 60+ seconds, then verify legacy Redis counters for cohort 3 keys have stopped incrementing while labkit counters continue.
- [ ] Run the enforce window for 24 hours. Pass criteria:
- Q5: p99 utilization within 10% of pre-flip baseline.
- Q6: post-enforce labkit block rate matches Phase 1 legacy block rate within boundary noise.
- Q7: no 429 / 5xx spike on affected feature-category dashboards (notifications, analytics, identity verification, groups_and_projects).
- [ ] If shadow or enforce fails: disable the relevant flag and investigate before retrying.
### Preparation before global rollout
- [ ] Set a milestone on this issue once both flags are stable in production.
- [ ] No external API consumer impact expected (these flags only change the implementation behind `ApplicationRateLimiter#throttled?`; the public Boolean return is unchanged).
### Release the feature
After both cohort 3 flags have been stable in production for at least one week, open a follow-up cleanup MR to:
- Remove the cohort 3 entries from `lib/gitlab/application_rate_limiter.rb`'s legacy `rate_limits` hash once all cohorts have cut over (alongside cohort 1 and 2).
- Remove the cohort 3 dispatch branch from `_throttled?` once all cohorts have cut over.
- Run `/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_3 --dev --pre --staging --staging-ref --production` and similarly for `_enforce`.
## Rollback Steps
For the whole cohort:
- [ ] Disable enforce: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3_enforce false` (restores legacy enforcement within seconds; shadow stays on so divergence keeps reporting).
- [ ] If shadow itself is the problem: also `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_3 false`.
- [ ] Verify legacy counters resume incrementing for cohort 3 keys.
For a single misbehaving key (without rolling back the whole cohort):
- [ ] Drop the key's `SupportedRateLimits` entry and reship as a follow-up MR.
- [ ] Confirm the next deploy removes the key from cohort 3 dispatch.
To delete the flags from all environments after rollback:
```
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_3 --dev --pre --staging --staging-ref --production
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_3_enforce --dev --pre --staging --staging-ref --production
```
## References
- https://gitlab.com/gitlab-org/gitlab/-/merge_requests/235212
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28808
- https://gitlab.com/gitlab-org/gitlab/-/work_items/598560 (cohort 1 & 2 rollout)
- https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 (parent epic)
issue