fix: add k8s labels back to kube-state-metrics
Background
When rolling out the prometheus helm chart
upgrade
in
gprd
we started seeing some metrics
disappear.
As part of the upgrade we are updating kube-state-metrics from
v1.9.7 to v2.2.0.
This is because we have a recording rule
kube_ingress_labels:labeled
that depends on kube_ingress_labels having kubernetes labels as part
of the metric label. As pointed out by
Ahmad
this was changed in
https://github.com/kubernetes/kube-state-metrics/pull/1125 so this ended
up removing some labels from our metrics which we depend on such as
lbel_stage and label_tier.
Solution
Define
kube-state-metrics.metricLabelsAllowlist
where you specify the resource that you want and which labels.
For example, if you define deployments=[INeedThisLabel] it will add the
INeedThisLabel to the metric label. Using [*] means it will add
every label.
The full list of resources can be found in https://github.com/kubernetes/kube-state-metrics/blob/b730cb415234509e6a1425c79e826f2e7688d27b/internal/store/builder.go#L222-L252.
The list of resources was picked by looking at the usage of these
metrics inside of our runbooks
using ripgrep we can grep for
kube_.*_labels where .* is for the resource. Then when we add the
resources that we wanted we can filter them out to see if we missed
anything runbooks master rg 'kube_.*_labels' | rg -v -e 'gitlab:kube_node_pool_label' -e 'pod' -e 'deployment' -e 'ingress' -e 'node' -e 'hpa'.
Note that gitlab:kube_node_pool_label is a recording
rule
and not something kube-state-metrics exposes
Testing
You can test these locally in a minikube cluster helmfile -e minikube apply
Find the IP of kube-state-metrics with kubectl -n monitoring get svc gitlab-monitoring-kube-state-metrics then run the following curl
requests and make sure the label_* is present.
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_ingress_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_node_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_pod_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_deployment_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_horizontalpodautoscaler_labels'
Thanos links to check on pre when this is deployed:
- ingress: https://thanos.gitlab.net/graph?g0.expr=kube_ingress_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- pod: https://thanos.gitlab.net/graph?g0.expr=kube_pod_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- hpa: https://thanos.gitlab.net/graph?g0.expr=kube_horizontalpodautoscaler_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- deployment: https://thanos.gitlab.net/graph?g0.expr=kube_deployment_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- node: https://thanos.gitlab.net/graph?g0.expr=kube_node_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
Some other dashboard that we should look at:
- We shoud see this metric coming back: https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=2358192786&orgId=1&var-PROMETHEUS_DS=Global&var-environment=pre&var-stage=main https://thanos.gitlab.net/graph?g0.expr=avg_over_time(gitlab_component_ops%3Arate_5m%7Bcomponent%3D%22nginx_ingress%22%2Cenv%3D%22pre%22%2Cenvironment%3D%22pre%22%2Cmonitor%3D%22global%22%2Cstage%3D%22main%22%2Ctype%3D%22api%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=8w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2021-10-21%2008%3A50%3A29&g0.moment_input=2021-10-21%2008%3A50%3A29
To compare you can update the query to {env="grpd"} so that you can
compare to what we have in gprd (known good)
reference https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973