[FF] Cohort 1 & 2 ApplicationRateLimiter to labkit migration flags rollout (#598560) · Issues · GitLab.org / GitLab

[FF] Cohort 1 & 2 ApplicationRateLimiter to labkit migration flags rollout

<details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Label this issue](https://contributors.gitlab.com/manage-issue?action=label&projectId=278964&issueIid=598560) </details>  ## Summary This issue tracks the rollout of 10 ops feature flags (5 keys × 2 flags) introduced in !233816, which routes a first cohort of `ApplicationRateLimiter` keys through `Labkit::RateLimit::Limiter`. The five rate-limit keys covered are `pipelines_create`, `notes_create`, `search_rate_limit`, `users_get_by_id`, and `user_sign_in`. For each key, two flags gate the rollout: - `rate_limiter_use_labkit_<key>` opts the key into the labkit path (shadow mode). - `rate_limiter_use_labkit_<key>_enforce` lets labkit's decision win over legacy. See the work item for full design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28803. ## Owners - Most appropriate Slack channel to reach out to: `#proj-ai-to-prod-rate-limits` - Best individual to reach out to: @mwoolf @reprazent @swiskow @rnienaber ## Expectations ### What are we expecting to happen? In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter, but the legacy path's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a 24-hour shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise. In enforce mode (both flags on) the labkit path's decision blocks users; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree. ### What can go wrong and how would we detect it? - Divergence between legacy and labkit decisions above the 0.5% target — visible in `gitlab_rate_limiter_labkit_shadow_total{agreement="diverge"}` on Grafana. - Increased Redis latency from the additional round-trip per check (one `incr`, one `expire`, one recovery `get`). Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance. - Throttle thresholds firing earlier or later than expected — visible as anomalies in `gitlab_application_rate_limiter_throttle_utilization_ratio` for the affected `throttle_key`. - Sign-in regressions on `user_sign_in` — `verifies_with_email` uses a block-form `check_rate_limit!`; failure modes appear as authentication errors rather than rate-limit errors. ## Rollout Steps Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command. ### Rollout on non-production environments - Verify !233816 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`. - [ ] Enable `rate_limiter_use_labkit_users_get_by_id` on non-production with `/chatops gitlab run feature set rate_limiter_use_labkit_users_get_by_id true --dev --pre --staging --staging-ref` and verify shadow counters appear in dashboards. - [ ] Enable the remaining four `rate_limiter_use_labkit_<key>` flags one at a time on non-production, verifying after each that the shadow counter shows agreement and dashboards stay clean. - [ ] Allow the shadow window to run for 24 hours total per key on staging-canary before promoting any flag to production shadow. ### Specific rollout on production For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel. For each of the five keys, in this order (lowest blast radius first): 1. `users_get_by_id` 2. `notes_create` 3. `search_rate_limit` 4. `pipelines_create` 5. `user_sign_in` Per key: - [ ] Enable shadow mode in production: `/chatops gitlab run feature set rate_limiter_use_labkit_<key> true`. - [ ] Wait 60+ seconds for Flipper L1 cache to propagate to all puma workers, then verify shadow counters incrementing. - [ ] Run the shadow window for 24 hours. Pass criterion: < 0.5% divergence rate from `gitlab_rate_limiter_labkit_shadow_total`, excluding 1-second window-boundary events. - [ ] If shadow passes: enable enforce with `/chatops gitlab run feature set rate_limiter_use_labkit_<key>_enforce true`. - [ ] Wait 60+ seconds, then verify the legacy Redis counter for that key has stopped incrementing while the labkit counter continues. - [ ] If shadow fails: disable `rate_limiter_use_labkit_<key>` and investigate divergence before retrying. ### Preparation before global rollout - [ ] Set a milestone on this issue once all five enforce flags are stable in production. - [ ] No external API consumer impact expected (these flags only change the implementation behind `ApplicationRateLimiter#throttled?`; the public Boolean return is unchanged). ### Release the feature After all five enforce flags have been stable for at least one week, open a follow-up cleanup MR to: - Remove the `LabkitAdapter` dispatch from `_throttled?` and delete the legacy code path for these five keys. - Remove the 10 YAML definitions. - Run `/chatops gitlab run feature delete <flag-name> --dev --pre --staging --staging-ref --production` for each. ## Rollback Steps For any single key that misbehaves: - [ ] Disable enforce: `/chatops gitlab run feature set rate_limiter_use_labkit_<key>_enforce false`. - [ ] Disable shadow: `/chatops gitlab run feature set rate_limiter_use_labkit_<key> false`. - [ ] Verify legacy counter resumes incrementing. To roll back the whole cohort at once, repeat the two commands for all five keys. To delete from all environments after rollback: ``` /chatops gitlab run feature delete rate_limiter_use_labkit_<key> --dev --pre --staging --staging-ref --production /chatops gitlab run feature delete rate_limiter_use_labkit_<key>_enforce --dev --pre --staging --staging-ref --production ``` ## Cohort 2 flags This issue also tracks the rollout of the cohort 2 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234565: - `rate_limiter_use_labkit_cohort_2` opts the cohort into the labkit path (shadow mode). - `rate_limiter_use_labkit_cohort_2_enforce` lets labkit's decision win over legacy. Cohort 2 covers ~95 `ApplicationRateLimiter` keys (83 CE + 12 EE). The full list lives in `lib/gitlab/application_rate_limiter/labkit_adapter/supported_rate_limits.rb` and `ee/lib/ee/gitlab/application_rate_limiter/labkit_adapter/supported_rate_limits.rb`. Unlike cohort 1, these keys share a single flag pair rather than each having its own. ### Operating constraints - **Binary on/off, not percentage.** Many cohort 2 keys fire from Sidekiq workers (`bitbucket_server_import`, `gitea_import`, `github_import`, `project_import`, `project_fork_sync`, `auto_rollback_deployment`, etc.) where `Feature.current_request` resolves to a per-call UUID and percentage rollouts behave non-deterministically. Toggle as fully on or fully off. - **No surgical kill-switch.** If shadow divergence on a single key exceeds tolerance, the only mitigation is to drop that key from the registry and reship; toggling the flag affects all 95 keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`. ### Cohort 2 rollout on non-production environments - [ ] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234565 is merged to `master` and deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit>`. - [ ] Enable shadow on non-production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2 true --dev --pre --staging --staging-ref`. - [ ] Verify shadow counters appear in dashboards for a representative sample of cohort 2 keys (e.g. `bulk_import`, `expanded_diff_files`, `raw_blob`, `web_hook_event_resend`, `gitlab_shell_operation`). - [ ] Run the shadow window for 24 hours total on staging-canary before promoting to production shadow. ### Cohort 2 rollout on production For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel. - [ ] Enable shadow in production: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2 true`. - [ ] Wait 60+ seconds for Flipper L1 cache to propagate, then verify shadow counters incrementing per-key. - [ ] Run the shadow window for 24 hours. Pass criterion: < 0.5% divergence rate per key from `gitlab_rate_limiter_labkit_shadow_total`, excluding 1-second window-boundary events (`boundary="true"` label). Investigate any individual key whose divergence exceeds 0.5%; if a fix isn't feasible in-flight, drop the key from the registry and reship before flipping enforce. - [ ] If shadow passes: enable enforce with `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2_enforce true`. - [ ] Wait 60+ seconds, then verify legacy Redis counters for cohort 2 keys have stopped incrementing while labkit counters continue. - [ ] If shadow fails: disable shadow flag and investigate divergence before retrying. ### Cohort 2 release / cleanup After both cohort 2 flags have been stable in production for at least one week, open a follow-up cleanup MR to: - Remove the cohort 2 entries from `lib/gitlab/application_rate_limiter.rb`'s legacy `rate_limits` hash for keys that have been migrated successfully. - Remove the cohort 2 dispatch branch from `_throttled?` once all cohorts have cut over. - Run `/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_2 --dev --pre --staging --staging-ref --production` and similarly for `_enforce`. ### Cohort 2 rollback For the whole cohort: - [ ] Disable enforce: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2_enforce false`. - [ ] Disable shadow: `/chatops gitlab run feature set rate_limiter_use_labkit_cohort_2 false`. - [ ] Verify legacy counters resume incrementing for cohort 2 keys. For a single misbehaving key (without rolling back the whole cohort): - [ ] Drop the key's registry entry and reship as a follow-up MR. - [ ] Confirm the next deploy removes the key from the cohort 2 dispatch. To delete the flags from all environments after rollback: ``` /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_2 --dev --pre --staging --staging-ref --production /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_2_enforce --dev --pre --staging --staging-ref --production ``` ## References - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/233816 - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28803 - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28808 - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234565 - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28809

issue