2021-10-12: Upgrade Prometheus Operator in pre

Production Change

Change Summary

With gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!488 (merged) we need to upgrade the Prometheus Operator CRDs to v0.50.0. These needs to be handled by hand for the same reason as #3223 (closed). https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#from-17x-to-18x also specifies Helm does not automatically upgrade or install new CRDs on a chart upgrade, so you have to install the CRDs manually before updating

A follow up change management issue will be created to upgrade the rest of the clusters.

Change Details

Services Impacted - ServicePrometheus
Change Technician - @steveazz
Change Reviewer - @mwasilewski-gitlab
Time tracking - 15 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Set label changein-progress on this issue
Configure kubectl to connect to the cluster in gitlab-pre: gcloud container clusters get-credentials pre-gitlab-gke --region us-east1 --project gitlab-pre

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5

Upgrade operators inside of a shell with kubectl configured:

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5

Check that the crd was updated, checking that Time is the value of the current timestamp

kubectl get crd/alertmanagerconfigs.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/alertmanagers.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/podmonitors.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/probes.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/prometheuses.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/prometheusrules.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/servicemonitors.monitoring.coreos.com -o yaml | grep 'Time'
kubectl get crd/thanosrulers.monitoring.coreos.com -o yaml | grep 'Time'

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Rollback to the previous operator

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

Monitoring

Key metrics to observe

Metric: gitlab-monitoring-promethe-operator deployment
- Location: https://console.cloud.google.com/kubernetes/deployment/us-east1/pre-gitlab-gke/monitoring/gitlab-monitoring-promethe-operator/overview?project=gitlab-pre&pageState=(%22savedViews%22:(%22i%22:%22a41cc54e84014ee2a23c9bdd66e7383a%22,%22c%22:%5B%5D,%22n%22:%5B%5D))
- What changes to this metric should prompt a rollback: Any problem that cause pods to crash.

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Changes checklist

Edited Oct 12, 2021 by Steve Xuereb