[FF] Cohort 1 & 2 ApplicationRateLimiter to labkit migration flags rollout
<!--IssueSummary start-->
<details>
<summary>
Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards.
</summary>
- [Label this issue](https://contributors.gitlab.com/manage-issue?action=label&projectId=278964&issueIid=598560)
</details>
<!--IssueSummary end-->
## Summary
This issue tracks the rollout of 10 ops feature flags (5 keys × 2 flags) introduced in !233816, which routes a first cohort of `ApplicationRateLimiter` keys through `Labkit::RateLimit::Limiter`. The five rate-limit keys covered are `pipelines_create`, `notes_create`, `search_rate_limit`, `users_get_by_id`, and `user_sign_in`.
For each key, two flags gate the rollout:
- `rate_limiter_use_labkit_<key>` opts the key into the labkit path (shadow mode).
- `rate_limiter_use_labkit_<key>_enforce` lets labkit's decision win over legacy.
See the work item for full design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28803.
## Owners
- Most appropriate Slack channel to reach out to: `#proj-ai-to-prod-rate-limits`
- Best individual to reach out to: @mwoolf @reprazent @swiskow @rnienaber
## Expectations
### What are we expecting to happen?
In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter, but the legacy path's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a 24-hour shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise.
In enforce mode (both flags on) the labkit path's decision blocks users; the legacy path is skipped entirely.
End-user-visible behavior is unchanged across both states when the two paths agree.
### What can go wrong and how would we detect it?
- Divergence between legacy and labkit decisions above the 0.5% target — visible in `gitlab_rate_limiter_labkit_shadow_total{agreement="diverge"}` on Grafana.
- Increased Redis latency from the additional round-trip per check (one `incr`, one `expire`, one recovery `get`). Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance.
- Throttle thresholds firing earlier or later than expected — visible as anomalies in `gitlab_application_rate_limiter_throttle_utilization_ratio` for the affected `throttle_key`.
- Sign-in regressions on `user_sign_in` — `verifies_with_email` uses a block-form `check_rate_limit!`; failure modes appear as authentication errors rather than rate-limit errors.
## Rollout Steps
Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.
### Rollout on non-production environments
- Verify !233816 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`.
- [ ] Enable `rate_limiter_use_labkit_users_get_by_id` on non-production with `/chatops gitlab run feature set rate_limiter_use_labkit_users_get_by_id true --dev --pre --staging --staging-ref` and verify shadow counters appear in dashboards.
- [ ] Enable the remaining four `rate_limiter_use_labkit_<key>` flags one at a time on non-production, verifying after each that the shadow counter shows agreement and dashboards stay clean.
- [ ] Allow the shadow window to run for 24 hours total per key on staging-canary before promoting any flag to production shadow.
### Specific rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel.
For each of the five keys, in this order (lowest blast radius first):
1. `users_get_by_id`
2. `notes_create`
3. `search_rate_limit`
4. `pipelines_create`
5. `user_sign_in`
Per key:
- [ ] Enable shadow mode in production: `/chatops gitlab run feature set rate_limiter_use_labkit_<key> true`.
- [ ] Wait 60+ seconds for Flipper L1 cache to propagate to all puma workers, then verify shadow counters incrementing.
- [ ] Run the shadow window for 24 hours. Pass criterion: < 0.5% divergence rate from `gitlab_rate_limiter_labkit_shadow_total`, excluding 1-second window-boundary events.
- [ ] If shadow passes: enable enforce with `/chatops gitlab run feature set rate_limiter_use_labkit_<key>_enforce true`.
- [ ] Wait 60+ seconds, then verify the legacy Redis counter for that key has stopped incrementing while the labkit counter continues.
- [ ] If shadow fails: disable `rate_limiter_use_labkit_<key>` and investigate divergence before retrying.
### Preparation before global rollout
- [ ] Set a milestone on this issue once all five enforce flags are stable in production.
- [ ] No external API consumer impact expected (these flags only change the implementation behind `ApplicationRateLimiter#throttled?`; the public Boolean return is unchanged).
### Release the feature
After all five enforce flags have been stable for at least one week, open a follow-up cleanup MR to:
- Remove the `LabkitAdapter` dispatch from `_throttled?` and delete the legacy code path for these five keys.
- Remove the 10 YAML definitions.
- Run `/chatops gitlab run feature delete <flag-name> --dev --pre --staging --staging-ref --production` for each.
## Rollback Steps
For any single key that misbehaves:
- [ ] Disable enforce: `/chatops gitlab run feature set rate_limiter_use_labkit_<key>_enforce false`.
- [ ] Disable shadow: `/chatops gitlab run feature set rate_limiter_use_labkit_<key> false`.
- [ ] Verify legacy counter resumes incrementing.
To roll back the whole cohort at once, repeat the two commands for all five keys.
To delete from all environments after rollback:
```
/chatops gitlab run feature delete rate_limiter_use_labkit_<key> --dev --pre --staging --staging-ref --production
/chatops gitlab run feature delete rate_limiter_use_labkit_<key>_enforce --dev --pre --staging --staging-ref --production
```
## Cohort 2 flags
This issue also tracks the rollout of the cohort 2 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234565:
- `rate_limiter_use_labkit_cohort_2` opts the cohort into the labkit path (shadow mode).
- `rate_limiter_use_labkit_cohort_2_enforce` lets labkit's decision win over legacy.
Cohort 2 covers ~95 `ApplicationRateLimiter` keys (83 CE + 12 EE). The full list lives in `lib/gitlab/application_rate_limiter/labkit_adapter/supported_rate_limits.rb` and `ee/lib/ee/gitlab/application_rate_limiter/labkit_adapter/supported_rate_limits.rb`. Unlike cohort 1, these keys share a single flag pair rather than each having its own.
### Operating constraints
- **Binary on/off, not percentage.** Many cohort 2 keys fire from Sidekiq workers (`bitbucket_server_import`, `gitea_import`, `github_import`, `project_import`, `project_fork_sync`, `auto_rollback_deployment`, etc.) where `Feature.current_request` resolves to a per-call UUID and percentage rollouts behave non-deterministically. Toggle as fully on or fully off.
- **No surgical kill-switch.** If shadow divergence on a single key exceeds tolerance, the only mitigation is to drop that key from the registry and reship; toggling the flag affects all 95 keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`.
### Cohort 2 rollout on non-production environments
- [ ] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234565 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`.
- [ ] Enable shadow on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2 true --dev --pre --staging --staging-ref`.
- [ ] Verify shadow counters appear in dashboards for a representative sample of cohort 2 keys (e.g. `bulk_import`, `expanded_diff_files`, `raw_blob`, `web_hook_event_resend`, `gitlab_shell_operation`).
- [ ] Run the shadow window for 24 hours total on staging-canary before promoting to production shadow.
### Cohort 2 rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel.
- [ ] Enable shadow in production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2 true`.
- [ ] Wait 60+ seconds for Flipper L1 cache to propagate, then verify shadow counters incrementing per-key.
- [ ] Run the shadow window for 24 hours. Pass criterion: < 0.5% divergence rate per key from `gitlab_rate_limiter_labkit_shadow_total`, excluding 1-second window-boundary events (`boundary="true"` label). Investigate any individual key whose divergence exceeds 0.5%; if a fix isn't feasible in-flight, drop the key from the registry and reship before flipping enforce.
- [ ] If shadow passes: enable enforce with `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2_enforce true`.
- [ ] Wait 60+ seconds, then verify legacy Redis counters for cohort 2 keys have stopped incrementing while labkit counters continue.
- [ ] If shadow fails: disable shadow flag and investigate divergence before retrying.
### Cohort 2 release / cleanup
After both cohort 2 flags have been stable in production for at least one week, open a follow-up cleanup MR to:
- Remove the cohort 2 entries from `lib/gitlab/application_rate_limiter.rb`'s legacy `rate_limits` hash for keys that have been migrated successfully.
- Remove the cohort 2 dispatch branch from `_throttled?` once all cohorts have cut over.
- Run `/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_2 --dev --pre --staging --staging-ref --production` and similarly for `_enforce`.
### Cohort 2 rollback
For the whole cohort:
- [ ] Disable enforce: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2_enforce false`.
- [ ] Disable shadow: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2 false`.
- [ ] Verify legacy counters resume incrementing for cohort 2 keys.
For a single misbehaving key (without rolling back the whole cohort):
- [ ] Drop the key's registry entry and reship as a follow-up MR.
- [ ] Confirm the next deploy removes the key from the cohort 2 dispatch.
To delete the flags from all environments after rollback:
```
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_2 --dev --pre --staging --staging-ref --production
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_2_enforce --dev --pre --staging --staging-ref --production
```
## References
- https://gitlab.com/gitlab-org/gitlab/-/merge_requests/233816
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28803
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28808
- https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234565
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28809
issue