Add Prometheus metrics for labkit rate limit checks (#28798) · Issues · GitLab.com / GitLab Infrastructure Team / Production Engineering

Add Prometheus metrics for labkit rate limit checks

Add Prometheus metrics to `Labkit::RateLimit` to observe rate limiting behavior without additional log volume. Parent epic: https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/2021 Context: https://gitlab.com/gitlab-org/ruby/gems/labkit-ruby/-/merge_requests/272#note_3294183874 ## Existing metrics inventory The new labkit metrics must provide equivalent observability to the existing rate limiting metrics so dashboards can be updated to show labkit-based rate limiting alongside or instead of the legacy metrics. ### ApplicationRateLimiter | Metric | Type | Labels | What it measures | |---|---|---|---| | `gitlab_application_rate_limiter_throttle_utilization_ratio` | Histogram (buckets: 0.25, 0.5, 0.75, 1.0) | `throttle_key`, `peek`, `feature_category` | Ratio of current count to threshold. Used in the [Rate Limiting Overview dashboard](https://dashboards.gitlab.net/d/rate-limiting-rate-limiting_overview) via bucket subtraction to compute throttled request rate: `rate(bucket{le="+Inf"}) - rate(bucket{le="1"})`. | ### RackAttack | Metric | Type | Labels | What it measures | |---|---|---|---| | `gitlab_rack_attack_events_total` | Counter | `event_type`, `event_name` | Total RackAttack events (throttle/blocklist/track). Rate of events per throttle name. | | `gitlab_rack_attack_throttle_limit` | Gauge | `event_name` | Configured limit per throttle. | | `gitlab_rack_attack_throttle_period_seconds` | Gauge | `event_name` | Configured period per throttle. | ### Dashboard usage The [Rate Limiting Overview dashboard](https://dashboards.gitlab.net/d/rate-limiting-rate-limiting_overview) (source: `runbooks/dashboards/rate-limiting/main.dashboard.jsonnet`) uses these metrics in the RackAttack and ApplicationRateLimiter sections. No alerts are currently configured on these metrics — they are dashboard-only for observability. ## Proposed labkit metrics ### Counters | Metric | Labels | Purpose | |---|---|---| | `gitlab_labkit_rate_limiter_calls_total` | `rate_limiter`, `rule`, `action` | Incremented on every **successful** `check` call (no Redis errors). The `action` label distinguishes outcomes (see below). Equivalent of `gitlab_rack_attack_events_total`. | | `gitlab_labkit_rate_limiter_errors_total` | `rate_limiter` | Incremented when a `check` call fails (Redis unavailable, etc.). Separate from `calls_total` to keep successful check metrics clean. | **`action` label values on `calls_total`:** | `action` | `rule` | Meaning | |---|---|---| | `"block"` | rule name | Rule matched with `action: :block` and count exceeded limit — request blocked | | `"log"` | rule name | Rule matched with `action: :log` and count exceeded limit — would have blocked, only logged (shadow mode) | | `"allow"` | rule name | Rule matched but count is within limit — request allowed | | `"allow"` | `"unmatched"` | No rule matched — request allowed (no rate limit applied) | **Error handling:** When Redis fails, only `errors_total` is incremented. `calls_total` is NOT incremented for failed checks — it reflects only successful checks where we have a definitive outcome. The error is also visible via `result.error?` and the structured warning log. **Useful PromQL queries:** - Total successful calls: `sum(rate(gitlab_labkit_rate_limiter_calls_total[5m]))` - Blocked rate: `sum(rate(gitlab_labkit_rate_limiter_calls_total{action="block"}[5m]))` - Would-have-blocked (shadow): `sum(rate(gitlab_labkit_rate_limiter_calls_total{action="log"}[5m]))` - Unmatched rate: `sum(rate(gitlab_labkit_rate_limiter_calls_total{rule="unmatched"}[5m]))` - Error rate: `sum(rate(gitlab_labkit_rate_limiter_errors_total[5m]))` - Error ratio: `rate(errors_total[5m]) / (rate(calls_total[5m]) + rate(errors_total[5m]))` ### Gauges | Metric | Labels | Multiprocess mode | Purpose | |---|---|---|---| | `gitlab_labkit_rate_limiter_limit` | `rate_limiter`, `rule` | `:max` | The configured limit value per rule (resolved from callable if applicable). Equivalent of `gitlab_rack_attack_throttle_limit`. | | `gitlab_labkit_rate_limiter_period_seconds` | `rate_limiter`, `rule` | `:max` | The configured period per rule (resolved from callable if applicable). Equivalent of `gitlab_rack_attack_throttle_period_seconds`. | Gauges are only set on successful matched checks (when we have the resolved values). ### Multiprocess mode for gauges GitLab runs Puma with multiple workers. Each worker sets the same gauge value (since the configured limit/period is identical across workers). Using `multiprocess_mode: :max` ensures only **one value per label set** is emitted when Prometheus scrapes, avoiding N duplicate copies. ```ruby Labkit::Metrics::Client.gauge( :gitlab_labkit_rate_limiter_limit, 'The configured rate limit threshold', { rate_limiter: nil, rule: nil }, :max ) ``` The existing RackAttack gauges use the default `:all` mode (per-worker duplicates). The new labkit gauges improve on this. Reference: `prometheus-client-mmap` gem — gauge multiprocess modes are `:all`, `:liveall`, `:livesum`, `:max`, `:min`. See `lib/prometheus/client/helper/metrics_processing.rb` for merge behavior. ## Implementation plan ### Files to create/modify | File | Action | Purpose | |---|---|---| | `lib/labkit/rate_limit/metrics.rb` | Create | Module with 4 memoized metric accessors (2 counters, 2 gauges) using `Labkit::Metrics::Client` | | `lib/labkit/rate_limit/evaluator.rb` | Modify | Emit metrics after evaluation: `calls_total` on success, `errors_total` on failure, gauges on match | | `lib/labkit/rate_limit.rb` | Modify | Add `autoload :Metrics` | | `spec/labkit/rate_limit/metrics_spec.rb` | Create | Verify metric definitions, types, labels, multiprocess mode | | `spec/labkit/rate_limit/evaluator_spec.rb` | Modify | Tests for metrics emission: all action label values, error counter, gauges with resolved callable values | ### Metrics emitted in Evaluator The `Evaluator` has direct access to the matched rule, resolved limit/period, and result. The metrics are emitted at three points: 1. **Successful match:** After `evaluate_rule` succeeds — increment `calls_total` (with action label), set both gauges with resolved limit/period. 2. **No match:** After rules loop with no match — increment `calls_total` with `matched_rule: "unmatched"`, `action: "allow"`. 3. **Error:** In the `rescue` block — increment `errors_total` only. `calls_total` is NOT incremented. All metric calls are wrapped in their own `rescue StandardError` to ensure metrics emission never breaks the rate limit check. ## Future investigation: utilization ratio histogram The ApplicationRateLimiter emits `gitlab_application_rate_limiter_throttle_utilization_ratio` — a histogram showing how close each rate limit is to its threshold (buckets at 25%, 50%, 75%, 100%). The dashboard uses bucket subtraction to derive the throttled request rate. With the `gitlab_labkit_rate_limiter_calls_total{action="block"}` counter, we get the throttled rate directly. The question is whether the utilization *distribution* (seeing limits at 75% before they fire) adds enough value to justify the cardinality cost of a histogram. This requires `current_count` on the result object, which is tracked in https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28785. Once that ships, we can evaluate whether to add: | Metric | Type | Labels | Buckets | |---|---|---|---| | `gitlab_labkit_rate_limiter_utilization_ratio` | Histogram | `rate_limiter`, `rule` | `[0.25, 0.5, 0.75, 1.0]` | Decision deferred until #28785 is complete and we can assess the need based on dashboard usage.

issue