Production Change

Change Summary

Upgrade Prometheus from v2.30.0 to v2.34.0 on gprd GKE clusters

Change Details

Services Impacted - ServicePrometheus
Change Technician - @steveazz
Change Reviewer - @rehab
Time tracking - 60 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Make sure gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!649 (merged) is approved.
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 2

Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!649 (merged)

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 25

Check that gprd-us-east1-b cluster was updated and both pods are running

# Window 1
$ glsh kube use-cluster gprd-us-east1-b
# Window 2
$ watch kubectl -n monitoring get po -l app=prometheus -o wide
# Window 3
$ kubectl -n monitoring logs prometheus-gitlab-monitoring-promethe-prometheus-1 --follow

Check that gprd-us-east1-c cluster was updated and both pods are running

# Window 1
$ glsh kube use-cluster gprd-us-east1-c
# Window 2
$ watch kubectl -n monitoring get po -l app=prometheus -o wide
# Window 3
$ kubectl -n monitoring logs prometheus-gitlab-monitoring-promethe-prometheus-1 --follow

Check that gprd-us-east1-d cluster was updated and both pods are running

# Window 1
$ glsh kube use-cluster gprd-us-east1-d
# Window 2
$ watch kubectl -n monitoring get po -l app=prometheus -o wide
# Window 3
$ kubectl -n monitoring logs prometheus-gitlab-monitoring-promethe-prometheus-1 --follow

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30

Revert gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!649 (merged)

Monitoring

Key metrics to observe

Metric: Apdex
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview?orgId=1&viewPanel=4146537329&from=1647569940000&to=1647591599999
- What changes to this metric should prompt a rollback: Drop in apdex
Metric: Memory usage
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview?orgId=1&viewPanel=4146537329&from=1647569940000&to=1647591599999
- What changes to this metric should prompt a rollback: Memory exceeding 200Gb
Metrics: Pod Info
- Location: https://dashboards.gitlab.net/d/kubernetes-pods/kubernetes-pods?orgId=1&var-datasource=default&var-cluster=gprd-gitlab-gke&var-namespace=monitoring&var-pod=prometheus-gitlab-monitoring-promethe-prometheus-0&var-container=prometheus
- What changes to this metric should prompt a rollback:
  - High CPU usage
  - Total Restarts Per Container is high

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

2022-03-18: Upgrade Prometheus servers in gprd GKE Clusters