[FF] Cohort 6 EE rate-limit registry corrections rollout
## Summary
This issue tracks the rollout of the cohort 6 flag pair introduced in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237215, which routes five EE-only `ApplicationRateLimiter` keys through `Labkit::RateLimit::Limiter`. Cohort 6 is the "EE registry corrections" slice: each of these keys has a live caller in `master` but the EE registry's exclusion comment falsely claimed they had no call site.
The two flags:
- `rate_limiter_use_labkit_cohort_6` opts the cohort into the labkit path (shadow mode).
- `rate_limiter_use_labkit_cohort_6_enforce` lets labkit's decision win over legacy.
Cohort 6 covers five EE rate-limit keys:
| Key | Characteristics | Call site |
|---|---|---|
| `container_scanning_for_registry_scans` | `[project]` | `ee/app/services/app_sec/container_scanning/scan_image_service.rb` |
| `dependency_scanning_sbom_scan_api_download` | `[project]` | `ee/lib/api/security/vulnerability_scanning/sbom_scans.rb` |
| `dependency_scanning_sbom_scan_api_upload` | `[project]` | `ee/lib/api/security/vulnerability_scanning/sbom_scans.rb` |
| `semantic_code_search_ad_hoc_indexing` | `[namespace]` | `ee/lib/ai/active_context/concerns/rate_limiting.rb` |
| `semantic_search_rate_limit` | `[user]` | `ee/lib/ee/api/search/semantic_code_search.rb` |
The `partner_{aws,gcp,postman}_api` keys remain deliberately excluded from this cohort: their intervals are sub-second, pending `Labkit::RateLimit` sub-second support.
See the cohort issue for design context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/29077.
## Owners
- Most appropriate Slack channel to reach out to: \`#proj-ai-to-prod-rate-limits\`
- Best individual to reach out to: @mwoolf
## Expectations
### What are we expecting to happen?
In shadow mode (use_labkit on, enforce off) the labkit path runs alongside legacy and increments its own Redis counter, but legacy's decision is what users see. The Prometheus counter `gitlab_rate_limiter_labkit_shadow_total` records per-key agreement so a 24-hour shadow run can confirm parity before cutover. Target: < 0.5% divergence excluding 1-second window-boundary noise (\`boundary=\"true\"\`).
In enforce mode (both flags on) the labkit path's decision blocks users; the legacy path is skipped entirely. End-user-visible behavior is unchanged across both states when the two paths agree.
No `.peek` callers exist for any cohort 6 key, so the peek dispatch path is not exercised by this cohort.
### What can go wrong and how would we detect it?
- Divergence between legacy and labkit decisions above the 0.5% target, visible in `gitlab_rate_limiter_labkit_shadow_total{agreement=\"diverge\", boundary=\"false\"}` on Grafana.
- A cohort 6 key receives an unexpected per-call `threshold:` or `interval:` override, visible in `gitlab_rate_limiter_labkit_override_total{key=...}`. The two `dependency_scanning_sbom_scan_api_*` keys use callable lambda thresholds; the adapter supports these, so this metric should remain zero for cohort 6.
- Increased Redis latency from the additional round-trip per check. Watch `gitlab_redis_client_requests_total` and the Redis-client latency histogram on the rate-limiting instance.
- Throttle thresholds firing earlier or later than expected, visible as anomalies in `gitlab_application_rate_limiter_throttle_utilization_ratio` for the affected `throttle_key`.
- Regressions in container scanning, SBOM-scan API authorization, semantic code search throttling, or ad-hoc indexing fairness, visible in the corresponding feature-category dashboards.
The Grafana go/no-go queries (Q1 to Q7) for shadow and enforce phases follow the same shape as cohort 3: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810#note_3328172110.
## Rollout Steps
Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.
### Operating constraints
- **Binary on/off, not percentage.** The adapter uses `Feature.current_request` as the actor; for non-request callers it resolves to a per-call UUID and percentage rollouts behave non-deterministically. Toggle fully on or fully off.
- **No surgical kill-switch within the cohort.** If shadow divergence on a single key exceeds tolerance, the mitigation is to drop that key from the registry and reship; toggling the flag affects all five keys at once. Per-key visibility remains in `gitlab_rate_limiter_labkit_shadow_total{key}`.
### Rollout on non-production environments
- [ ] Verify https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237215 is merged to `master` and deployed to non-production environments with \`/chatops gitlab run auto_deploy status <merge-commit>\`.
- [ ] Enable shadow on non-production: \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6 true --dev --pre --staging --staging-ref\`.
- [ ] Verify shadow counters appear in dashboards for the five cohort 6 keys.
- [ ] Run the shadow window for 24 hours total on staging-canary before promoting to production shadow.
### Rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [\`#production\` Slack channel](https://gitlab.slack.com/archives/C101F3796) and cross-posted to the responsible team channel.
- [ ] Enable shadow in production: \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6 true\`.
- [ ] Wait 60+ seconds for Flipper L1 cache to propagate to all puma workers, then verify shadow counters incrementing per-key (Q1).
- [ ] Run the shadow window for 24 hours. Pass criteria:
- Q2: per-key divergence < 0.5% from `gitlab_rate_limiter_labkit_shadow_total{boundary=\"false\"}`.
- Q3: no cohort 6 key in `gitlab_rate_limiter_labkit_override_total`.
- Q4: labkit error rate small relative to calls.
- [ ] If shadow passes: enable enforce with \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6_enforce true\`.
- [ ] Wait 60+ seconds, then verify legacy Redis counters for cohort 6 keys have stopped incrementing while labkit counters continue.
- [ ] Run the enforce window for 24 hours. Pass criteria:
- Q5: p99 utilization within 10% of pre-flip baseline.
- Q6: post-enforce labkit block rate matches Phase 1 legacy block rate within boundary noise.
- Q7: no 429 / 5xx spike on affected feature-category dashboards (container scanning, SBOM ingestion, semantic search, ad-hoc indexing).
- [ ] If shadow or enforce fails: disable the relevant flag and investigate before retrying.
### Preparation before global rollout
- [ ] Set a milestone on this issue once both flags are stable in production.
- [ ] No external API consumer impact expected (these flags only change the implementation behind `ApplicationRateLimiter#throttled?`; the public Boolean return is unchanged).
### Release the feature
After both cohort 6 flags have been stable in production for at least one week, open a follow-up cleanup MR to:
- Remove the cohort 6 entries from `ee/lib/ee/gitlab/application_rate_limiter.rb`'s legacy `rate_limits` hash once all cohorts have cut over.
- Remove the cohort 6 dispatch branch from `_throttled?` once all cohorts have cut over.
- Run \`/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_6 --dev --pre --staging --staging-ref --production\` and similarly for `_enforce`.
## Rollback Steps
For the whole cohort:
- [ ] Disable enforce: \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6_enforce false\` (restores legacy enforcement within seconds; shadow stays on so divergence keeps reporting).
- [ ] If shadow itself is the problem: also \`/chatops gitlab run feature set rate_limiter_use_labkit_cohort_6 false\`.
- [ ] Verify legacy counters resume incrementing for cohort 6 keys.
For a single misbehaving key (without rolling back the whole cohort):
- [ ] Drop the key's `SupportedRateLimits` entry and reship as a follow-up MR.
- [ ] Confirm the next deploy removes the key from cohort 6 dispatch.
To delete the flags from all environments after rollback:
\`\`\`
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_6 --dev --pre --staging --staging-ref --production
/chatops gitlab run feature delete rate_limiter_use_labkit_cohort_6_enforce --dev --pre --staging --staging-ref --production
\`\`\`
## References
- https://gitlab.com/gitlab-org/gitlab/-/merge_requests/237215
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/29077
- https://gitlab.com/gitlab-org/gitlab/-/work_items/599632 (cohort 3 rollout)
- https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 (parent epic)
issue