[FF] Cohort 6 EE rate-limit registry corrections rollout
## Summary This issue tracks the rollout of the cohort 6 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237215, which routes five EE-only `ApplicationRateLimiter` keys through `Labkit::RateLimit::Limiter`. Cohort 6 is the "EE registry corrections" slice: each of these keys has a live caller in `master` but the EE registry's exclusion comment falsely claimed they had no call site. The two flags: - `rate_limiter_use_labkit_cohort_6` opts the cohort into the labkit path (shadow mode). - `rate_limiter_use_labkit_cohort_6_enforce` lets labkit's decision win over legacy. Cohort 6 covers five EE rate-limit keys: | Key | Characteristics | Call site | |---|---|---| | `container_scanning_for_registry_scans` | `[project]` | `ee/app/services/app_sec/container_scanning/scan_image_service.rb` | | `dependency_scanning_sbom_scan_api_download` | `[project]` | `ee/lib/api/security/vulnerability_scanning/sbom_scans.rb` | | `dependency_scanning_sbom_scan_api_upload` | `[project]` | `ee/lib/api/security/vulnerability_scanning/sbom_scans.rb` | | `semantic_code_search_ad_hoc_indexing` | `[namespace]` | `ee/lib/ai/active_context/concerns/rate_limiting.rb` | | `semantic_search_rate_limit` | `[user]` | `ee/lib/ee/api/search/semantic_code_search.rb` | The `partner_{aws,gcp,postman}_api` keys remain deliberately excluded from this cohort: their intervals are sub-second, pending `Labkit::RateLimit` sub-second support. See the cohort issue for design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/29077. ## Owners - Most appropriate Slack channel to reach out to: \`#proj-ai-to-prod-rate-limits\` - Best individual to reach out to: @mwoolf ## Expectations ### What are we expecting to happen? In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter, but legacy's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a 24-hour shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise (\`boundary=\"true\"\`). In enforce mode (both flags on) the labkit path's decision blocks users; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree. No `.peek` callers exist for any cohort 6 key, so the peek dispatch path is not exercised by this cohort. ### What can go wrong and how would we detect it? - Divergence between legacy and labkit decisions above the 0.5% target, visible in `gitlab_rate_limiter_labkit_shadow_total{agreement=\"diverge\", boundary=\"false\"}` on Grafana. - A cohort 6 key receives an unexpected per-call `threshold:` or `interval:` override, visible in `gitlab_rate_limiter_labkit_override_total{key=...}`. The two `dependency_scanning_sbom_scan_api_*` keys use callable lambda thresholds; the adapter supports these, so this metric should remain zero for cohort 6. - Increased Redis latency from the additional round-trip per check. Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance. - Throttle thresholds firing earlier or later than expected, visible as anomalies in `gitlab_application_rate_limiter_throttle_utilization_ratio` for the affected `throttle_key`. - Regressions in container scanning, SBOM-scan API authorization, semantic code search throttling, or ad-hoc indexing fairness, visible in the corresponding feature-category dashboards. The Grafana go/no-go queries (Q1 to Q7) for shadow and enforce phases follow the same shape as cohort 3: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810#note_3328172110. ## Rollout Steps Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command. ### Operating constraints - **Binary on/off, not percentage.** The adapter uses `Feature.current_request` as the actor; for non-request callers it resolves to a per-call UUID and percentage rollouts behave non-deterministically. Toggle fully on or fully off. - **No surgical kill-switch within the cohort.** If shadow divergence on a single key exceeds tolerance, the mitigation is to drop that key from the registry and reship; toggling the flag affects all five keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`. ### Rollout on non-production environments - [ ] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237215 is merged to `master` and deployed to non-production environments with \`/chatops gitlab run auto_deploy status <merge-commit>\`. - [ ] Enable shadow on non-production: \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6 true --dev --pre --staging --staging-ref\`. - [ ] Verify shadow counters appear in dashboards for the five cohort 6 keys. - [ ] Run the shadow window for 24 hours total on staging-canary before promoting to production shadow. ### Rollout on production For visibility, all `/chatops` commands that target production must be executed in the [\`#production\` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel. - [ ] Enable shadow in production: \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6 true\`. - [ ] Wait 60+ seconds for Flipper L1 cache to propagate to all puma workers, then verify shadow counters incrementing per-key (Q1). - [ ] Run the shadow window for 24 hours. Pass criteria: - Q2: per-key divergence < 0.5% from `gitlab_rate_limiter_labkit_shadow_total{boundary=\"false\"}`. - Q3: no cohort 6 key in `gitlab_rate_limiter_labkit_override_total`. - Q4: labkit error rate small relative to calls. - [ ] If shadow passes: enable enforce with \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6_enforce true\`. - [ ] Wait 60+ seconds, then verify legacy Redis counters for cohort 6 keys have stopped incrementing while labkit counters continue. - [ ] Run the enforce window for 24 hours. Pass criteria: - Q5: p99 utilization within 10% of pre-flip baseline. - Q6: post-enforce labkit block rate matches Phase 1 legacy block rate within boundary noise. - Q7: no 429 / 5xx spike on affected feature-category dashboards (container scanning, SBOM ingestion, semantic search, ad-hoc indexing). - [ ] If shadow or enforce fails: disable the relevant flag and investigate before retrying. ### Preparation before global rollout - [ ] Set a milestone on this issue once both flags are stable in production. - [ ] No external API consumer impact expected (these flags only change the implementation behind `ApplicationRateLimiter#throttled?`; the public Boolean return is unchanged). ### Release the feature After both cohort 6 flags have been stable in production for at least one week, open a follow-up cleanup MR to: - Remove the cohort 6 entries from `ee/lib/ee/gitlab/application_rate_limiter.rb`'s legacy `rate_limits` hash once all cohorts have cut over. - Remove the cohort 6 dispatch branch from `_throttled?` once all cohorts have cut over. - Run \`/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_6 --dev --pre --staging --staging-ref --production\` and similarly for `_enforce`. ## Rollback Steps For the whole cohort: - [ ] Disable enforce: \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6_enforce false\` (restores legacy enforcement within seconds; shadow stays on so divergence keeps reporting). - [ ] If shadow itself is the problem: also \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6 false\`. - [ ] Verify legacy counters resume incrementing for cohort 6 keys. For a single misbehaving key (without rolling back the whole cohort): - [ ] Drop the key's `SupportedRateLimits` entry and reship as a follow-up MR. - [ ] Confirm the next deploy removes the key from cohort 6 dispatch. To delete the flags from all environments after rollback: \`\`\` /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_6 --dev --pre --staging --staging-ref --production /chatops gitlab run feature delete rate_limiter_use_labkit_cohort_6_enforce --dev --pre --staging --staging-ref --production \`\`\` ## References - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237215 - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/29077 - https://gitlab.com/gitlab-org/gitlab/-/work_items/599632 (cohort 3 rollout) - https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 (parent epic)
issue