All kubernetes resources should have labels matching our Prometheus label taxonomy
- Introduction: what is a label taxonomy: https://gitlab.com/gitlab-com/runbooks/-/tree/master/libsonnet/label-taxonomy
- Related to gitlab-com/runbooks!4335 (merged)
At present on GitLab.com, we loosely follow a convention whereby Kubernetes resources (eg, pods, deployments, ingresses, services, etc) have metadata labels matching our Prometheus metadata label taxonomy of type (service identifier), tier, stage (cny, main, blue, green, etc) and shard (marquee, hdd, urgent-cpu-bound etc).
This is being loosely followed, but there are no mechanisms to enforce this, and there has been a drift, which leads to exceptions and complexity in monitoring resources.
Another side effect of this is that we have to configure kube-state-metrics to export all prometheus labels (*). This is expensive and kube-state-metrics specifically warns against doing this. If we moved to a well defined set of labels, we could use these labels instead of *.
Related gitlab-com/runbooks!4335 (merged)
Proposal
-
Add CI jobs to prevent incorrect configuration being deployed.
-
Add Prometheus Alerts (straight to issues) to alert when label requirements are not correctly configured.
Examples of incorrect labelling at present
-
apinodepools do not havestagelabels (see gitlab-com/runbooks!4335 (diffs, comment 839851398)) -
gitnodepools have the incorrect type label (should begit). The currenttypelabel should be theshardlabel. Currentlygitnodepools do not publish astagelabel. See gitlab-com/runbooks!4335 (diffs, comment 839895611) -
defaultandhighmemnode pools should havestage=main(stage, type, tier, mandatory) -
loggingpods (deployed via a daemonset) do not have a stage label. Should bestage=main -
monitoringservice (thanos, prometheus, alertmanager, etc) do not have labels. Should havetype,stage,tieretc See gitlab-com/runbooks!4335 (diffs, comment 839923617) -
nginxpods, ingress, deployments do not have correcttypeorstagelabels. See gitlab-com/runbooks!4335 (diffs, comment 839926299) -
registrynodes do not havestagelabel: gitlab-com/runbooks!4335 (diffs, comment 839951839) -
sidekiqnodes have an incorrecttypelabel consisting of theshard.typeshould besidekiq, existingtypelabel should be moved to theshardlabel. See gitlab-com/runbooks!4335 (diffs, comment 839959454) -
web-pagenodes should have astagelabel: gitlab-com/runbooks!4335 (diffs, comment 840018931) -
websocketnodes should have astagelabel: gitlab-com/runbooks!4335 (diffs, comment 840027408) -
woodhouseresources should be correctly labelled: gitlab-com/runbooks!4335 (diffs, comment 840029422)
Side note: some services don't use stages (eg sidekiq), so why is it important to label them. Having an absent field, instead of a default value for a field makes everything more complicated. For example, when aggregating health across multiple services, how do we handle absent values? Another example are our dashboards, which will automatically include the stage selector in a query. If this is missing from the source data, the graph will not render correctly. Using a default value keeps things simple compared to missing values. On the prometheus side, we follow this policy too: all services have at least a main stage, if not any others.