Implement aggregation for metrics with endpoint_id

Problem: High Cardinality from endpoint_id

Within our GitLab application metrics, several labels contribute to high cardinality metrics, with endpoint_id being one of the most impactful. While we do have other higher unique value labels in other areas, like queryid from postgres, endpoint_d contributes to the highest number of unique values overall. This is due to the scale at which we run rails pods, which can have much higher turn over due to scaling events, as well as deployments/frequent rollout operations.

There has been some discussions about this previously, but due to how this label is currently used, it’s non-trivial to remove or modify at the source.

Ambiguity in endpoint_id

The concept of an “endpoint” varies depending on context:

An API route with a method (e.g., GET /api/:version/projects/:id)
A Rails controller
A GraphQL operation

The endpoint_id is built from endpoint_id_for_route, which uses the route.origin.

One discussed approach was to reduce cardinality by grouping endpoints more logically. For example:

GET /api/:version/packages/conan/v1/conans/:package_name/:package_version/:package_username/:package_channel
↓
GET /api/:version/packages

This would retain enough high-level insight for triage, while logs could be used for deeper inspection (and ideally traces in the future).

However:

This only applies to route-based endpoints
Rails controller and GraphQL endpoints would still pose challenges
Truncating or grouping routes intelligently would likely require a static map or override system to maintain

While we should continue to explore long-term solutions for reducing the granularity of emitted metrics, this is not a quick fix. We still should look to addresses this in the future, to ensure we don't continue with unbound label growth as the product evolves.

Proposal: Recording Rules to Aggregate High-Cardinality Metrics

In the shorter term, we can mitigate the impact of endpoint_id by using recording rules to drop volatile labels like pod, instance, and node from metrics where endpoint_id is present.

Why this matters

Consider the metric: gitlab_sli_rails_request_apdex_success_total

At the time of inspection, it includes (but not limited to):

endpoint_id: 2335 unique values
pod: 2533 unique values
instance: 2533 unique values
node: 382 unique values

Additionally:

endpoint_id appears in 20.6% of all series
Due to pod churn, the combination of these labels can causes a rapid explosion of unique series being pushed to Mimir.

You can explore this further via the label cardinality dashboard — values will vary based on peak vs off-peak usage, as they are snapshotted at the time of query.

While endpoint_id on its own has a large amount of unique values, the dimensions added by the churn on less useful labels like pod has a significant effect on the series generated.

Suggested Action

Introduce recording rules to:

Aggregate metrics that include endpoint_id
Drop per-pod and per-instance dimensions (pod, instance, node) where they aren’t adding meaningful differentiation

This will:

Reduce series churn and ingestion cost
Improve query performance
Preserve meaningful insights at the endpoint level

Long-Term

Ultimately, a proper distributed tracing system will allow us to:

Correlate metrics with traces
Navigate from high-level SLI failures to specific endpoints and requests
Eliminate the need to encode this much granularity into metric labels

Until then, recording rules provide a practical and incremental step toward managing cardinality more effectively.