Feature category & stage group metrics take a long time to evaluate in Thanos

Coming from gitlab-com/runbooks!4125 (comment 744157549). Thanos consistently can't evaluate the feature category & stage group aggregations within it's interval. This is visible in these metrics.

The main cause of this is likely the high cardinality of the feature category aggregation in Prometheus, mostly from the puma (old) and rails_requests (new) components.

label values
__name__ ops, error, apdex:weight, apdex:success 4
type web, api, git 3
feature_category source_code_management, project_managenemnt, ... 69
stage main, cny 2
env gprd, gstg, pre 3
region us-east1-d, ... 4
cluster gprd-gitlab-gke 9
178848

Scratchpad with queries where I got the information

This is very much a worst case scenario, in reality the rails_requests components result in at most 1800 combinations (times 4 metrics). Which is still high.

Goal

Make it possible for these aggregation to be aggregated globally to be queried in dashboards.

Edited by Bob Van Landuyt