Update Prometheus rules using kubelet and kube-state-metrics metrics to be evaluated by Thanos instead
The partitioned Prometheus instances scrape the kubelet
and kube-state-metrics
endpoints (and some others), and rightly so it shouldn't, because it would result in duplicated metrics in Thanos. But because of this, the rules combining those metrics with metrics from the targeted ServiceMonitors (for example gitlab_component_saturation:ratio
for the component kube_go_memory
fail to evaluate, resulting in empty dashboards panels and missed alerts.
A possible solution would be to move those rules to thanos-rule
instead. One concern about this is the added load on those instances, which could possibly be solved by migrating it to Kubernetes with some autoscaling: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/10969
Edited by Pierre Guinoiseau