fix: add k8s labels back to kube-state-metrics
Background
When rolling out the prometheus helm chart
upgrade
in
gprd
we started seeing some metrics
disappear.
As part of the upgrade we are updating kube-state-metrics
from
v1.9.7
to v2.2.0
.
This is because we have a recording rule
kube_ingress_labels:labeled
that depends on kube_ingress_labels
having kubernetes labels as part
of the metric label. As pointed out by
Ahmad
this was changed in
https://github.com/kubernetes/kube-state-metrics/pull/1125 so this ended
up removing some labels from our metrics which we depend on such as
lbel_stage
and label_tier
.
Solution
Define
kube-state-metrics.metricLabelsAllowlist
where you specify the resource that you want and which labels.
For example, if you define deployments=[INeedThisLabel]
it will add the
INeedThisLabel
to the metric label. Using [*]
means it will add
every label.
The full list of resources can be found in https://github.com/kubernetes/kube-state-metrics/blob/b730cb415234509e6a1425c79e826f2e7688d27b/internal/store/builder.go#L222-L252.
The list of resources was picked by looking at the usage of these
metrics inside of our runbooks
using ripgrep we can grep for
kube_.*_labels
where .*
is for the resource. Then when we add the
resources that we wanted we can filter them out to see if we missed
anything runbooks master rg 'kube_.*_labels' | rg -v -e 'gitlab:kube_node_pool_label' -e 'pod' -e 'deployment' -e 'ingress' -e 'node' -e 'hpa'
.
Note that gitlab:kube_node_pool_label
is a recording
rule
and not something kube-state-metrics
exposes
Testing
You can test these locally in a minikube cluster helmfile -e minikube apply
Find the IP of kube-state-metrics
with kubectl -n monitoring get svc gitlab-monitoring-kube-state-metrics
then run the following curl
requests and make sure the label_*
is present.
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_ingress_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_node_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_pod_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_deployment_labels'
$ curl -s 10.101.180.95:8080/metrics | grep 'kube_horizontalpodautoscaler_labels'
Thanos links to check on pre
when this is deployed:
- ingress: https://thanos.gitlab.net/graph?g0.expr=kube_ingress_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- pod: https://thanos.gitlab.net/graph?g0.expr=kube_pod_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- hpa: https://thanos.gitlab.net/graph?g0.expr=kube_horizontalpodautoscaler_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- deployment: https://thanos.gitlab.net/graph?g0.expr=kube_deployment_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- node: https://thanos.gitlab.net/graph?g0.expr=kube_node_labels%7Benv%3D%22pre%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
Some other dashboard that we should look at:
- We shoud see this metric coming back: https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=2358192786&orgId=1&var-PROMETHEUS_DS=Global&var-environment=pre&var-stage=main https://thanos.gitlab.net/graph?g0.expr=avg_over_time(gitlab_component_ops%3Arate_5m%7Bcomponent%3D%22nginx_ingress%22%2Cenv%3D%22pre%22%2Cenvironment%3D%22pre%22%2Cmonitor%3D%22global%22%2Cstage%3D%22main%22%2Ctype%3D%22api%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=8w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2021-10-21%2008%3A50%3A29&g0.moment_input=2021-10-21%2008%3A50%3A29
To compare you can update the query to {env="grpd"}
so that you can
compare to what we have in gprd (known good)
reference https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973