2021-10-12: Upgrade Prometheus Operator in pre
Production Change
Change Summary
With gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!488 (merged) we need to upgrade the Prometheus Operator CRDs to v0.50.0
. These needs to be handled by hand for the same reason as #3223 (closed). https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#from-17x-to-18x also specifies Helm does not automatically upgrade or install new CRDs on a chart upgrade, so you have to install the CRDs manually before updating
A follow up change management issue will be created to upgrade the rest of the clusters.
Change Details
- Services Impacted - ServicePrometheus
-
Change Technician -
@steveazz
- Change Reviewer - @mwasilewski-gitlab
- Time tracking - 15 minutes
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1
-
Set label changein-progress on this issue -
Configure kubectl
to connect to the cluster ingitlab-pre
:gcloud container clusters get-credentials pre-gitlab-gke --region us-east1 --project gitlab-pre
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Upgrade operators inside of a shell with kubectl
configured:kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.50.0/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Check that the crd was updated, checking that Time
is the value of the current timestampkubectl get crd/alertmanagerconfigs.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/alertmanagers.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/podmonitors.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/probes.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/prometheuses.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/prometheusrules.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/servicemonitors.monitoring.coreos.com -o yaml | grep 'Time' kubectl get crd/thanosrulers.monitoring.coreos.com -o yaml | grep 'Time'
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Rollback to the previous operator kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/release-0.42/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml
Monitoring
Key metrics to observe
- Metric:
gitlab-monitoring-promethe-operator
deployment- Location: https://console.cloud.google.com/kubernetes/deployment/us-east1/pre-gitlab-gke/monitoring/gitlab-monitoring-promethe-operator/overview?project=gitlab-pre&pageState=(%22savedViews%22:(%22i%22:%22a41cc54e84014ee2a23c9bdd66e7383a%22,%22c%22:%5B%5D,%22n%22:%5B%5D))
- What changes to this metric should prompt a rollback: Any problem that cause pods to crash.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.