Stage 2a Cohort 1: migrate 5 rate limits to Labkit::RateLimit (#28803) · Issues · GitLab.com / GitLab Infrastructure Team / Production Engineering

Stage 2a Cohort 1: migrate 5 rate limits to Labkit::RateLimit

Parent epic: gitlab-com/gl-infra#2021 (Phase 2: Rate Limiting Simplification) ## Summary Migrate 5 high-profile rate limits in `Gitlab::ApplicationRateLimiter` to delegate counting to `Labkit::RateLimit::Limiter` (gem v1.14.0, post labkit-ruby!272 / Spec 8). Public API stays unchanged; the swap is internal to `_throttled?` and gated by 10 ops feature flags (one `_use_labkit_<key>` and one `_<key>_enforce` per cohort key). No upstream labkit work is required to ship this iteration. The 5 cohort keys are `pipelines_create`, `notes_create`, `search_rate_limit`, `users_get_by_id`, `user_sign_in`. Each is independently shadow-validated for at least 24 hours before its enforcement flag is flipped. Subsequent iterations in this epic will pick up additional `IncrementPerAction` keys 5 to 10 at a time. Later cohorts pick up the `.peek`, `IncrementPerActionedResource`, and `IncrementResourceUsagePerAction` callers once labkit grows the corresponding upstream primitives. ## Problem Statement `Gitlab::ApplicationRateLimiter` (lib/gitlab/application_rate_limiter.rb) is GitLab's application-layer rate limiter, called from controllers, API endpoints, and services in ~89 places (58 direct + 31 indirect via `CheckRateLimit`). It implements a fixed-window counter with three Redis strategies (`INCR`, `SADD`/`SCARD`, `INCRBYFLOAT`) and its own metrics, allowlist, scope-mapping, and bypass-header logic. Phase 2 of the rate-limiting epic replaces the per-application internal counting with `Labkit::RateLimit` so that the monolith, GATE, and future services share one configuration contract and one observability surface. Without this migration, every new GitLab service has to reimplement rate-limiter primitives, and changes to throttle policy require coordinated edits across multiple codebases. This issue is the first iteration of Stage 2a: replace internal counting with labkit for 5 carefully selected rate limits while preserving the existing `Gitlab::ApplicationRateLimiter.throttled?` public API. Success here proves the adapter pattern works against production traffic and unblocks subsequent iterations covering the remaining ~64 in-scope `IncrementPerAction` keys. ## Goals (in scope) This issue migrates 5 high-profile rate limits to `Labkit::RateLimit::Limiter`. The migration uses the labkit API as it already stands (labkit-ruby!272 / Spec 8, released as gem v1.14.0). No upstream labkit additions are required. The 5 keys, chosen for visibility, traffic diversity, and coverage of every shape the adapter must handle (Proc and literal thresholds; 1-minute and 10-minute windows; 1, 2, and 3-element scopes; REST API, GraphQL, controller-concern, and direct-call entry points): | Rate-limit key | Threshold | Interval | Scope shape | Entry points | |---|---|---|---|---| | `pipelines_create` | Proc (settings) | 1 min | `[project, user, sha]` | direct `.throttled?` in `lib/gitlab/ci/pipeline/chain/limit/rate_limit.rb:13` | | `notes_create` | Proc (settings) | 1 min | `[current_user]` (with allowlist) | `lib/api/notes.rb:143`, `app/graphql/mutations/notes/create/base.rb:66`, `app/controllers/concerns/notes_actions.rb:21` | | `search_rate_limit` | Proc (settings) | 1 min | `[current_user, safe_search_scope]` | `lib/api/helpers.rb:902`, `app/controllers/concerns/search_rate_limitable.rb:15` | | `users_get_by_id` | Proc (settings) | 10 min | `current_user`/lookup | `lib/api/users.rb:261` | | `user_sign_in` | literal `5` | 10 min | `[user]` | `app/controllers/concerns/verifies_with_email.rb:31` (block-form `check_rate_limit! { true }`) | These 5 keys are selected as proposed; the cohort owner may swap any with infra-reliability sign-off before opening MRs. Goals: - Migrate `_throttled?` (lib/gitlab/application_rate_limiter.rb:283) to delegate to a `Labkit::RateLimit::Limiter` for these 5 keys when the relevant per-key feature flag is on. - Preserve the public API surface: `.throttled?`, `.throttled_request?`, `.peek`, `.resource_usage_throttled?` continue to accept the same kwargs and return the same Booleans. - Preserve the `users_allowlist` short-circuit, the bypass-header check, and the scope→cache-key serialization. - Preserve the `gitlab_application_rate_limiter_throttle_utilization_ratio` histogram (via a follow-up `redis.get` per migrated check; see "Technical details" below). - Roll out behind one ops feature flag per key, with shadow validation (`:log` action) before flipping each flag to enforcement. ## Non-Goals This issue does NOT cover the following: | Concern | Call sites | Reason | |---|---|---| | Other `IncrementPerAction` keys (~64 call sites) | all keys not in the cohort table above | Deferred to subsequent iterations of cohort 1. The adapter will support them; we are simply not flipping their flags here. | | `.peek` (read without increment) | 10 (notification_recipient, identity_verifiable, namespace, glql, user_risk_profile) | Labkit `Limiter#check` always increments. Future cohort once upstream `Limiter#peek` exists. | | `IncrementPerActionedResource` (`SADD`/`SCARD`) | 1 (ee/app/services/users/abuse/git_abuse/base_throttle_service.rb) | Labkit has no Set-based strategy. Future cohort once upstream lands. | | `IncrementResourceUsagePerAction` (`INCRBYFLOAT`) | 2 (lib/gitlab/resource_usage_limiter.rb) | Labkit has no float-cost strategy. Future cohort once upstream lands. | | Stage 2b new Rack middleware | n/a | Tracked separately in the same epic. | | Stage 2c response headers | n/a | Blocked on Result shape reconciliation (see Open Questions). | | Stage 2d default rate limits | n/a | Tracked separately in the same epic. | | Removal of legacy `Strategy` classes | n/a | Required for serving the deferred and out-of-cohort callers. Removed only after cohort 3 ships. | | Schema changes to `ApplicationSetting` | n/a | The adapter reads existing settings; no new columns. | | Changes to public API signatures (`.throttled?`, `.throttled_request?`, `.peek`, `.resource_usage_throttled?`) | n/a | Behavior preservation is a requirement. | This issue ships 5 of the 89 known call sites under labkit. The remaining ~64 `IncrementPerAction` non-peek call sites are unblocked technically (the adapter handles them) but their feature flags are not flipped here. Subsequent issues open in the same epic flip additional cohorts of 5 to 10 keys per iteration, reusing the same adapter. ## Dependencies ### Upstream (labkit-ruby) state The base labkit-ruby API for this iteration is the Spec 8 design merged in labkit-ruby!272 on 2026-04-28 and released as gem v1.14.0 the same day. This iteration targets v1.14.0 (or later compatible 1.x release). **No upstream additions are required as blockers.** The on-master surface this issue depends on: - `Labkit::RateLimit::Limiter.new(name:, rules:, redis: nil, logger: nil)` and `#check(identifier) -> Result` - `Labkit::RateLimit::Configuration` plus `Labkit::RateLimit.configure { |c| c.redis = ...; c.logger = ... }` - `Labkit::RateLimit::Result` with `matched?`, `exceeded?`, `action`, `rule`, `error?` - `Labkit::RateLimit::Rule.new(name:, limit:, period:, characteristics:, match: {}, action: :block)` (callable `limit`/`period` resolved at check time) - `Labkit::RateLimit::Identifier`, with arbitrary characteristic keys accepted (no fixed `KNOWN_CHARACTERISTICS` gate) - Compound Redis key shape: `labkit:rl:{name}:{rule_name}:{char1}:{val1}:{char2}:{val2}...` - Internal fail-open on `StandardError` returning `Result.new(matched: false, error: true)` The following items are noted as future work that would unlock subsequent cohorts but explicitly do not gate this issue: 1. `Result#current_count`. Adding the integer counter value to `Labkit::RateLimit::Result` would let the adapter populate the utilization-ratio histogram directly without a follow-up `redis.get`. Tracked as a future improvement. 2. `Labkit::RateLimit::Limiter#peek(identifier)`. A read-without-increment API, required to migrate the 10 `.peek` callers. Gates cohort 2. 3. Pluggable Redis strategies (Set, float). Required to migrate the 3 callers using `IncrementPerActionedResource` and `IncrementResourceUsagePerAction`. Gates cohort 3. 4. Reconcile `Result` shape between Spec 6 (gitlab-com/gl-infra/production-engineering#28785) and Spec 8 (labkit-ruby!272). Spec 6 defined `Result#to_response_headers` and `RuleState`; Spec 8 removed both. Stage 2c (response headers) cannot proceed until this is settled. Should be raised as a follow-up issue, but does not block this iteration. ### Pre-flight - Bump the `gitlab-labkit` gem in the monolith's `Gemfile` and `Gemfile.lock` to v1.14.0. Routine dependency bump. - Verify that labkit-ruby!271 (Stage 1b rule names, still open) does not conflict. Most of !271's surface (the `name:` field on `Rule`) appears already absorbed into !272 on master, so !271 may need rebase or close. ## Acceptance Criteria Each scenario maps to at least one RSpec test in this issue's MR, or to a production observation in the shadow-validation phase. ### A: API surface is unchanged **Given** an in-tree caller `Gitlab::ApplicationRateLimiter.throttled?(:k, scope: s)` for any key `:k` and scope `s`, **When** the caller is invoked with the labkit FF for that key in any state (on or off, enforce or shadow), **Then** the return value is a Boolean and the kwarg signature accepts the same keys it accepted before this MR. Test: existing controller and API spec suites pass without modification; `application_rate_limiter_spec.rb` assertions on call signatures continue to pass. ### B: Cache-key compatibility within a window (in-cohort key, FF on) **Given** `_use_labkit_pipelines_create` is on, **When** two `.throttled?(:pipelines_create, scope: s)` calls arrive within the same `divmod`-derived period bucket, **Then** both calls increment the same labkit Redis key, and the second call returns the count incremented by one. **And given** the same flag state, **When** two calls arrive on either side of a period boundary, **Then** they increment distinct labkit Redis keys (the count resets at the boundary). Test: spec asserting the labkit Redis key shape and using `Timecop` to step through the window boundary. ### C: Allowlist short-circuit is preserved **Given** a user `u` is in `users_allowlist` for key `:k`, **When** `.throttled?(:k, scope: [u], users_allowlist: [u.username])` is called for any cohort key, with FF on or off, **Then** no Redis writes occur (no `incr`, no `expire`, no `set`, no `sadd`) and the method returns `false`. Test: spec wrapping a Redis double that fails any write call. ### D: Bypass header is preserved **Given** `Gitlab::Throttle.bypass_header` is set and the request carries it as `1`, **When** `.throttled_request?(request, current_user, :k, scope: s)` is called for any cohort key, with FF on or off, **Then** no Redis writes occur and the method returns `false`. Test: spec. ### E: Utilization-ratio metric is preserved **Given** the labkit FF for cohort key `:k` is on, **When** `.throttled?(:k, scope: s)` is called and `Limiter#check` returns a non-error Result, **Then** `gitlab_application_rate_limiter_throttle_utilization_ratio` is observed exactly once with `count/threshold` for the matching label set, where `count` is recovered from a follow-up `redis.get`. **And given** the FF is off, **When** the same call occurs, **Then** the histogram is observed with `count/threshold` from the legacy strategy's return value. Test: spec asserting on `Gitlab::Metrics` registry with both flag states. ### F: Out-of-cohort callers are untouched **Given** any call where the key is NOT in the 5-key cohort table, OR `peek: true` is set, OR `resource:` is passed, OR the entry point is `.resource_usage_throttled?`, **When** the call is made with all 10 cohort flags on, **Then** the legacy strategy is invoked, not the labkit adapter, and labkit emits no log entries for that call. Test: spec. ### G: Fail-open semantics are preserved **Given** Redis is unavailable (e.g. labkit's internal `redis.incr` raises `Redis::CannotConnectError`), **When** `.throttled?(:k, scope: s)` is called for a cohort key with FF on, **Then** the method returns `false` (allow), the labkit `Result#error?` is true, and a structured warning is logged via `Gitlab::AuthLogger`. Test: spec stubbing labkit's internal Redis client to raise. ### H: Shadow validation passes per key (production) **Given** `_use_labkit_<key>` is on and `_<key>_enforce` is off for a cohort key, **When** at least 24 hours of production traffic flow, **Then** decision divergence between labkit (`:log` action) and legacy (authoritative) is less than 0.5%, excluding window-boundary noise (defined as identical-state checks within 1 second of a period rollover). Verification: Logs Explorer query comparing `rate_limit_check` entries (`message: "rate_limit_check"`, `name: "<key>"`) against `Application_Rate_Limiter_Request` entries (`env: "<key>_request_limit"`) over the validation window. ### I: Dual-flag semantics work correctly **Given** `_use_labkit_<key>` is on but `_<key>_enforce` is off, **When** `.throttled?(:k, scope: s)` is called and the labkit count exceeds the threshold, **Then** the method returns `false` (legacy enforces, labkit does not block); the labkit log entry contains `action: "log"` and `exceeded: true`. **And given** `_<key>_enforce` is then turned on, **When** the call is repeated above threshold, **Then** the method returns `true` (labkit blocks); the labkit log entry contains `action: "block"` and `exceeded: true`. Test: spec exercising both flag states. ## Security Considerations - **Bypass surface unchanged.** The `Gitlab::Throttle.bypass_header` check stays in `throttled_request?` above the labkit branch (lib/gitlab/application_rate_limiter.rb:202). The allowlist check stays in `_throttled?` above the labkit branch (lib/gitlab/application_rate_limiter.rb:286). No new bypass path is introduced. Verified by acceptance criteria C and D. - **Config injection.** Threshold and interval values come from `Gitlab::CurrentSettings.current_application_settings` (admin-controlled) or static literals in the `rate_limits` registry. The labkit adapter reads from the same sources; no new injection surface. Rate-limit keys themselves are static `Symbol` values compiled into the `rate_limits` hash, not user-controlled. - **Limit-value exposure.** The labkit `Result` carries the matched `Rule` (including `limit`, `period`, `characteristics`). This Result is consumed inside `_throttled?` and only its `exceeded?` Boolean is propagated to callers. The `Rule` object is never serialized into a response body, error message, or log entry visible to unauthenticated users. The adapter spec must assert that `Result#rule` is not referenced from any `render` or response-building path. - **Key-collision safety.** The labkit Redis key shape `labkit:rl:{name}:{rule_name}:{char1}:{val1}` is disjoint from the legacy `application_rate_limiter:{key}:{scope}:{period_key}` shape (different prefix, different format). Even with both code paths running on the same Redis instance during shadow validation, neither can read or overwrite the other's keys. There is no "ghost increment" risk where one path's writes leak into the other path's counter. - **Privilege.** All 10 ops feature flags require the standard `Feature.enable` / `Feature.disable` privilege (admin-controlled). No new privilege-escalation surface. - **Cardinality.** The adapter's `period_key` is bounded by interval (1-min interval means ~1440 buckets per day per scope tuple). Redis keys naturally expire after the rule period. The Prometheus utilization-ratio histogram label set (`throttle_key`, `peek`, `feature_category`) is unchanged from today, so no new label cardinality risk. - **Logging hygiene.** Labkit's structured logging includes the `identifier` hash, which contains the serialized scope (e.g. `"user:42:project:99"`). User IDs and project IDs are already logged by `log_request` today, so this is not a new exposure. Labkit must not log raw user input (e.g. unsanitized search queries from `:search_rate_limit` scope). The scope serializer must convert ActiveRecord models to `model_name:id` and skip non-ID attributes. Verified by spec. ## Rollout & Backwards Compatibility ### Self-managed and Dedicated The 10 ops flags default to off. Self-managed installations (including Dedicated) get unchanged behavior on upgrade. The Gemfile bump to gitlab-labkit v1.14.0 is the only change visible without flag flips, and v1.14.0 is a new feature addition (no breaking changes from v1.13). Self-managed admins can opt into the new path by enabling `_use_labkit_<key>` for any cohort key; if a bug surfaces, flipping the flag off restores the legacy path within seconds. If `Gitlab::Redis::RateLimiting` is misconfigured on a self-managed install (e.g. wrong host), the labkit branch fails open per criterion G, matching today's `with_suppressed_errors` behavior. ### Feature flags Two ops flags per cohort key (10 total): | Flag | Default | Effect when on | |---|---|---| | `:rate_limiter_use_labkit_pipelines_create` | off | Adapter calls `Limiter#check` for `:pipelines_create`. Legacy path is bypassed. | | `:rate_limiter_use_labkit_pipelines_create_enforce` | off | The labkit `Rule` is constructed with `action: :block` instead of `:log`. | | (same pattern for `notes_create`, `search_rate_limit`, `users_get_by_id`, `user_sign_in`) | off | (same) | No global kill-switch. Flags are independent: turning one off does not affect the others. Rationale: labkit's internal fail-open already handles Redis unavailability, and a global flag adds operational ambiguity (which flag wins?) without buying extra safety. Each key is rolled out and rolled back on its own. ### Dark launch sequence 1. **Per-key adapter introduction:** Open MR adding `Gitlab::ApplicationRateLimiter::LabkitAdapter` plus the 10 flags, all defaulted off. No behavior change. 2. **Per-key shadow:** For one key at a time, enable `_use_labkit_<key>` (with `_<key>_enforce` off). Labkit increments and logs but does not block; legacy path still enforces. Soak for 24 hours minimum. Confirm criterion H is met. 3. **Per-key enforcement flip:** Enable `_<key>_enforce`. Labkit's decision is now authoritative. Soak for 24 hours minimum. 4. **Sequence:** Lowest-traffic key first (`user_sign_in`), then `users_get_by_id`, `pipelines_create`, `search_rate_limit`, `notes_create` last. One key per deploy cycle. 5. **Rollback:** For any key, flipping `_<key>_enforce` off returns enforcement to legacy within seconds (next request after FF re-evaluation). Flipping `_use_labkit_<key>` off stops labkit Redis writes entirely. Both flags can be flipped independently. ### Public API stability `Gitlab::ApplicationRateLimiter.throttled?`, `.throttled_request?`, `.peek`, `.resource_usage_throttled?` keep their existing signatures and return types. No GraphQL schema changes. No REST API changes. No migration. No `ApplicationSetting` schema changes. ## Validation Loop / Verification Process Pre-MR validation by the implementing engineer or agent: 1. **Spec suite.** Run `bundle exec rspec spec/lib/gitlab/application_rate_limiter_spec.rb spec/lib/gitlab/application_rate_limiter/labkit_adapter_spec.rb spec/controllers/concerns/check_rate_limit_spec.rb` and confirm all pass. 2. **All-criteria coverage.** Confirm acceptance criteria A through G and I each have at least one passing spec example with the criterion ID quoted in the example name (e.g. `it "[Criterion B] increments the same labkit Redis key within a window"`). 3. **Local smoke test.** In GDK with a cohort flag on: - `Feature.enable(:rate_limiter_use_labkit_user_sign_in)` - Attempt 6 sign-ins from one user/IP within 10 minutes via the web UI. - Confirm the 6th gets a 429 (or the controller's redirect equivalent). - Inspect Redis: `redis-cli --scan --pattern 'labkit:rl:user_sign_in:*'` should show one or more keys. - `Feature.disable(...)` and repeat: legacy path enforces; `application_rate_limiter:user_sign_in:*` keys appear instead. 4. **Rubocop and Danger.** Routine. 5. **No removed APIs.** Confirm `git diff master -- lib/gitlab/application_rate_limiter.rb` shows no removed public methods or kwarg changes. Post-deploy validation (per cohort key, sequenced): 6. **Shadow window.** After enabling `_use_labkit_<key>` (with `_<key>_enforce` off), observe at least 24 hours of production traffic. Compute decision divergence per criterion H. If > 0.5%, disable `_use_labkit_<key>` and investigate before retry. 7. **Enforcement flip rehearsal in staging.** Flip `_<key>_enforce` on in staging during normal traffic. Confirm enforcement comes from labkit (check log entries for `action: "block"` on exceeded counters). Flip off after 30 minutes; confirm enforcement returns to legacy. 8. **Production enforcement flip.** Repeat in production. Soak 24 hours. 9. **Histogram sanity.** After each enforcement flip, query the utilization-ratio histogram for the cohort key. Confirm values appear in non-zero buckets and that p99 utilization for the key has not shifted more than 10% from the pre-flip baseline. ## Observability - **Metrics.** `gitlab_application_rate_limiter_throttle_utilization_ratio` continues to emit on every check. Labels (`throttle_key`, `peek`, `feature_category`) unchanged. Validate via `/metrics` scrape and the existing SRE rate-limit saturation dashboard. - **Logs.** Two streams during the migration: - `Application_Rate_Limiter_Request`: emitted by the legacy path. Continues to fire for FF-off keys and for over-limit requests on FF-on keys (since `log_request` runs after `throttled?` returns true, regardless of which branch produced the decision). - `rate_limit_check`: emitted by labkit's `Evaluator` for every `Limiter#check`. Routed to `Gitlab::AuthLogger` via the Rails initializer (`Labkit::RateLimit.configure { |c| c.logger = Gitlab::AuthLogger }`). Carries `name`, `rule_name`, `current_count`, `limit`, `period`, `action`, `exceeded`, `identifier`. - **Dashboards.** No SRE dashboard changes required for this iteration. Existing saturation alerts on `gitlab_application_rate_limiter_throttle_utilization_ratio >= 0.75` continue to fire. - **Shadow-window query.** During each per-key shadow validation phase, run a Logs Explorer query joining `rate_limit_check` and `Application_Rate_Limiter_Request` entries by approximate timestamp and identifier scope, computing the divergence ratio. Owner of cohort: paste a screenshot of the query result and the divergence number into this issue before flipping the corresponding `_enforce` flag. ## References - Epic: gitlab-com/gl-infra#2021 - Current implementation: lib/gitlab/application_rate_limiter.rb - Strategies: lib/gitlab/application_rate_limiter/{base_strategy,increment_per_action,increment_per_actioned_resource,increment_resource_usage_per_action}.rb - Controller concern: app/controllers/concerns/check_rate_limit.rb - Labkit Limiter API redesign: labkit-ruby!272 (Spec 8, merged 2026-04-28, released as v1.14.0) - Labkit rule-name keys: labkit-ruby!271 (Stage 1b, still open; likely subsumed by !272) - Spec 6 (Result/RuleState/headers): gitlab-com/gl-infra/production-engineering#28785 (stale relative to merged Spec 8; needs reconciliation)

issue