2021-10-20: Upgrade Prometheus Operator chart v19.1.0

Production Change

Change Summary

For gprd, gprd-us-east1-b, gprd-us-east1c, gprd-us-east1-d update the following:

Upgrade the Custom Resource Definitions for the promethe stack following https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#from-17x-to-18x, but using 0a38647379a5e93f639bf8e634deabcc32e01fb6 instead since we are depending on a commit on master.
Upgrade the helm chart version: 19.1.0 (from 10.3.5)
Upgrade the Prometheus operator: 0.51.2:0a38647379a5e93f639bf8e634deabcc32e01fb6 (from 0.42.1) which includes a fix that we require for alertmanager (read merge request description)

This was tested in pre in #5731 (closed) and ops, gstg in #5731 (closed)

Change Details

Services Impacted - ServicePrometheus
Change Technician - @steveazz
Change Reviewer - @mwasilewski-gitlab
Time tracking - 30
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Make sure gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!505 (merged) is updated
Set up kubectl to access the clusters: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/kube/k8s-oncall-setup.md#accessing-clusters-via-console-servers
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 20

Upgrade CRDs

Upgrade gprd

Set up kubectl to talk to the gprd-gitlab-gke

Apply CRD config

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

Upgrade gprd-us-east1-b

Set up kubectl to talk to the gprd-us-east1-b

Apply CRD config

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

Upgrade gprd-us-east1-c

Set up kubectl to talk to the gprd-us-east1-c

Apply CRD config

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

Upgrade gprd-us-east1-d

Set up kubectl to talk to the gprd-us-east1-d

Apply CRD config

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

Re-run ops pipeline for gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!505 (merged) and make sure the the gprd-* jobs are green now and it reports the same changes we saw in https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/jobs/5197138 (from gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!504 (merged))
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!505 (merged)
Manual run apply jobs from https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/pipelines/849919
- Manually run gprd-us-east1-b go through post-changes steps
- Manually run gprd-us-east1-c go through post-changes steps
- Manually run gprd-us-east1-d go through post-changes steps
- Manually run gprd go through post-changes steps

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10

Verify that all pods have restarted: kubectl -n monitoring get po apart from thanos, memcached, gitaly-exporter
Verify that the new operator version is running: kubectl -n monitoring get po gitlab-monitoring-promethe-operator-7dc8f7b879-4dk88 -o json | jq .spec.containers[0].image expected value is "ghcr.io/prometheus-operator/prometheus-operator:master@sha256:bb79240165868c7d73d3db2b45bd065bf2b3050729aa4809f6de79cace232feb"
Take a look at the operator logs and check if there are any error level logs: kubectl -n monitoring logs gitlab-monitoring-promethe-operator-7dc8f7b879-4dk88 --since=5m. If there is a large amount of logs you can filter for error level kubectl -n monitoring logs gitlab-monitoring-promethe-operator-7dc8f7b879-4dk88 --since=5m | grep 'err'
Verify that service discovery is working curl -s -L $(kubectl -n monitoring get svc prometheus-headless -o json | jq '.metadata.annotations["external-dns.alpha.kubernetes.io/hostname"]' -r):9090/metrics | grep 'scrape_pool_targets'
Check the ingress is working as expected: https://console.cloud.google.com/kubernetes/ingresses?project=gitlab-production&pageState=(%22savedViews%22:(%22i%22:%226c0e9c818063462585995d31405639f5%22,%22c%22:%5B%5D,%22n%22:%5B%5D),%22ingress_list_table%22:(%22f%22:%22%255B%255D%22)) if any backends are reporting unhealthy investigate

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Revert gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!505 (merged)

Monitoring

Key metrics to observe

Metric: Operator build info
- Location: https://thanos.gitlab.net/graph?g0.expr=prometheus_operator_build_info&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: Not seeing the new environments (this is a new metric)
Metric: Apdex
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview?viewPanel=712482646&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&from=1634095320000&to=1634116979999
- What changes to this metric should prompt a rollback: A dip in apdex
Metric: Alert sender SLI
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview?viewPanel=3098809023&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: A spike in apdex scope
Logs: Error logs
- Location: https://log.gprd.gitlab.net/goto/d97c71d4c18a34a4cbcb18eb0ee238d7
- What changes to this metric should prompt a rollback: A spike in error rates

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Changes checklist

Edited Oct 20, 2021 by Steve Xuereb - Out of Office back 2026-01-05