Stage 2a: Migrate all ApplicationRateLimiter call sites to labkit
## Summary This issue tracks the full migration of `Gitlab::ApplicationRateLimiter` call sites to `Labkit::RateLimit`. The migration is done in cohorts, each following the repeatable process below. Parent epic: https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 ## Cohorts | Cohort | Issue | Keys | Strategy | Labkit prerequisite | Status | |---|---|---|---|---|---| | 1 | https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28803 | `pipelines_create`, `notes_create`, `search_rate_limit`, `users_get_by_id`, `user_sign_in` | IncrementPerAction | None (v1.14.0) | In progress | | 2 | https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28809 | Remaining ~64 IncrementPerAction keys (non-peek) | IncrementPerAction | None | Draft | | 3 | https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28810 | 10 `.peek` callers | IncrementPerAction (peek) | `Limiter#peek` in labkit | Draft | | 4 | https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28811 | 1 `IncrementPerActionedResource` caller | Set-based (SADD/SCARD) | Set strategy in labkit | Draft | | 5 | https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28812 | 2 `IncrementResourceUsagePerAction` callers | Float-cost (INCRBYFLOAT) | Float-cost strategy in labkit | Draft | _The cohort table will be updated with issue links as draft issues are created and refined._ ## Repeatable migration process per cohort Extracted from https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28803 (Cohort 1). Each cohort follows this process: ### 1. Key selection Select 5-10 rate limit keys for the cohort. Consider: - Traffic volume diversity (mix of high and low traffic keys) - Scope shape diversity (different numbers of scope elements) - Entry point diversity (REST API, GraphQL, controller concern, direct call) - Redis headroom (see https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28807) ### 2. Feature flags Two ops flags per key: - `rate_limiter_use_labkit_<key>` — enables the labkit adapter for this key (default: off) - `rate_limiter_use_labkit_<key>_enforce` — switches the labkit rule action from `:log` to `:block` (default: off) Flags are independent per key. No global kill switch needed — labkit fails open on Redis errors. ### 3. Adapter implementation Inside `ApplicationRateLimiter._throttled?`, when the use-labkit flag is on for a key: - Construct the call site name from the rate limit key name - Construct the identifier from the scope objects (serialized as key-value pairs) - Build a single-element rules array with characteristics, limit, period, and action derived from existing config - Call `Labkit::RateLimit::Limiter#check(identifier)` and use the result - Preserve: allowlist short-circuit, bypass header check, utilization-ratio histogram ### 4. Shadow validation (per key) 1. Enable `_use_labkit_<key>` with `_<key>_enforce` off 2. Labkit counts and logs with `action: :log` but does not block; legacy path still enforces 3. Soak for minimum 24 hours of production traffic 4. Compare labkit decisions against legacy decisions — divergence must be < 0.5% (excluding window-boundary noise within 1 second of period rollover) 5. Post screenshot of divergence query result to the cohort issue ### 5. Enforcement flip (per key) 1. Enable `_<key>_enforce` — labkit's decision is now authoritative 2. Soak for minimum 24 hours 3. Monitor utilization-ratio histogram — p99 should not shift more than 10% from pre-flip baseline ### 6. Rollback - Flip `_<key>_enforce` off → enforcement returns to legacy within seconds - Flip `_use_labkit_<key>` off → stops labkit Redis writes entirely - Both flags independent, can be flipped separately ### 7. Sequence Roll out lowest-traffic key first, one key per deploy cycle. Sequence within each cohort is determined by the cohort issue. ## Requirements for cohort completion A cohort is complete when: - All keys in the cohort have `_use_labkit_<key>` and `_<key>_enforce` both enabled in production - Shadow validation passed for each key (< 0.5% divergence) - Enforcement soak passed for each key (24h, histogram stable) - Evidence posted to the cohort issue for each key
issue