Deciding on environment/tier/type/stage/shard labels for our kubernetes infrastructure
Currently our k8s metrics do not include our standard environment/tier/type/stage/shard labelling taxonomy.
Additionally, this infrastructure is not represented in the metrics catalog.
This results in several shortcomings - for example, our service-level metrics, and saturation monitoring tools do not monitor the k8s infrastructure as they relying on strong labelling.
See gitlab-com/runbooks!2242 (merged) for more details.
Proposal
-
All kubernetes infrastructure adheres to our
environment
,type
,tier
,shard
andstage
label taxonomy. I propose:type=kube,tier=inf,shard=main,stage=main|cny
. -
However, since our
kube
infrastructure is running other infrastructure, we should build a parallel taxonomy for describing the workloads being executed. As an starting point, I proposekube_workload_type
,kube_workload_shard
andkube_workload_stage
. For example:kube_workload_type=sidekiq,kube_workload_stage=main,kube_workload_shard=urgent-other
.
Implementation
How do we implement this taxonomy?
As present, we're using some pretty funky regexp replacements on the pod name for labelling. This will not scale and will quickly accumulate as technical debt, or "Regular Expression are a Smell": https://www.robustperception.io/regex-selectors-are-a-smell
I propose an approach based on k8s meta-data labelling:
- name: monitored_pods
namespaceSelector:
any: true
selector:
matchLabels:
monitored: true # Global config: no need to repeat this per service
podMetricsEndpoints:
- port: metrics
relabelings:
- sourceLabels: ["__meta_kubernetes_node_label_shard"]
regex: ".*"
replacement: '$0'
targetLabel: kube_workload_shard
- sourceLabels: ["__meta_kubernetes_node_label_stage"]
regex: ".*"
replacement: '$0'
targetLabel: kube_workload_stage
- sourceLabels: ["__meta_kubernetes_node_label_type"]
regex: ".*"
replacement: '$0'
targetLabel: kube_workload_type
Then, for the pod definition, we include the following metadata labels, which will automatically
- metadata:
labels:
type: sidekiq
stage: main
shard: urgent-other
monitored: true # allows matching for monitored_pods above