Feature category & stage group metrics take a long time to evaluate in Thanos
Coming from gitlab-com/runbooks!4125 (comment 744157549). Thanos consistently can't evaluate the feature category & stage group aggregations within it's interval. This is visible in these metrics.
The main cause of this is likely the high cardinality of the feature category aggregation in Prometheus, mostly from the puma (old) and rails_requests (new) components.
| label | values | |
|---|---|---|
__name__ |
ops, error, apdex:weight, apdex:success
|
4 |
type |
web, api, git
|
3 |
feature_category |
source_code_management, project_managenemnt, ... |
69 |
stage |
main, cny
|
2 |
env |
gprd, gstg, pre
|
3 |
region |
us-east1-d, ... |
4 |
cluster |
gprd-gitlab-gke |
9 |
| 178848 |
Scratchpad with queries where I got the information
This is very much a worst case scenario, in reality the rails_requests components result in at most 1800 combinations (times 4 metrics). Which is still high.
Goal
Make it possible for these aggregation to be aggregated globally to be queried in dashboards.
Edited by Bob Van Landuyt