Feature category & stage group metrics take a long time to evaluate in Thanos

Coming from gitlab-com/runbooks!4125 (comment 744157549). Thanos consistently can't evaluate the feature category & stage group aggregations within it's interval. This is visible in these metrics.

The main cause of this is likely the high cardinality of the feature category aggregation in Prometheus, mostly from the puma (old) and rails_requests (new) components.

label	values
`__name__`	`ops`, `error`, `apdex:weight`, `apdex:success`	4
`type`	`web`, `api`, `git`	3
`feature_category`	`source_code_management`, `project_managenemnt`, ...	69
`stage`	`main`, `cny`	2
`env`	`gprd`, `gstg`, `pre`	3
`region`	`us-east1-d`, ...	4
`cluster`	`gprd-gitlab-gke`	9
		178848

Scratchpad with queries where I got the information

This is very much a worst case scenario, in reality the rails_requests components result in at most 1800 combinations (times 4 metrics). Which is still high.

Goal

Make it possible for these aggregation to be aggregated globally to be queried in dashboards.

Edited Nov 26, 2021 by Bob Van Landuyt