Complete Destructive Chart bump rollout in Production

Production Change

Change Summary

We need to finish the rollout of a chart bump which resolves issue: delivery#1992 (closed)

This CR is to complete this change that is documented in this incident: #5539 (comment 677613287)

Change Details

Services Impacted - ServiceAPI ServiceGitlab Shell ServiceWeb ServiceWebsockets ServiceContainer Registry ServiceKAS
Change Technician - @skarbek
Change Reviewer - @cmiskell @pguinoiseau @ggillies
Time tracking - 1 hour
Downtime Component - better be 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 hour

Using the get-server-state and set-server-state scripts in our chef repo, we'll modify the state of our GKE backends
Put cluster B into maintenance:

./bin/set-server-state gprd maint api-gke-us-east1-b
./bin/set-server-state gprd maint git-https-gke-us-east1-b
./bin/set-server-state gprd maint registry-us-east1-b
./bin/set-server-state gprd maint shell-gke-us-east1-b
./bin/set-server-state gprd maint ws-gke-us-east1-b
./bin/set-server-state gprd maint web-gke-us-east1-b

Play the job for gprd-us-east1-b:uprade: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929
That job is now complete
Ready up cluster b

./bin/set-server-state gprd ready api-gke-us-east1-b
./bin/set-server-state gprd ready git-https-gke-us-east1-b
./bin/set-server-state gprd ready registry-us-east1-b
./bin/set-server-state gprd ready shell-gke-us-east1-b
./bin/set-server-state gprd ready ws-gke-us-east1-b
./bin/set-server-state gprd ready web-gke-us-east1-b

Put cluster C into maintenance:

./bin/set-server-state gprd maint api-gke-us-east1-c
./bin/set-server-state gprd maint git-https-gke-us-east1-c
./bin/set-server-state gprd maint registry-us-east1-c
./bin/set-server-state gprd maint shell-gke-us-east1-c
./bin/set-server-state gprd maint ws-gke-us-east1-c
./bin/set-server-state gprd maint web-gke-us-east1-c

Play the job for gprd-us-east1-c:uprade: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929
That job is now complete
Ready up cluster c

./bin/set-server-state gprd ready api-gke-us-east1-c
./bin/set-server-state gprd ready git-https-gke-us-east1-c
./bin/set-server-state gprd ready registry-us-east1-c
./bin/set-server-state gprd ready shell-gke-us-east1-c
./bin/set-server-state gprd ready ws-gke-us-east1-c
./bin/set-server-state gprd ready web-gke-us-east1-c

Put cluster D into maintenance:

./bin/set-server-state gprd maint api-gke-us-east1-d
./bin/set-server-state gprd maint git-https-gke-us-east1-d
./bin/set-server-state gprd maint registry-us-east1-d
./bin/set-server-state gprd maint shell-gke-us-east1-d
./bin/set-server-state gprd maint ws-gke-us-east1-d
./bin/set-server-state gprd maint web-gke-us-east1-d

Play the job for gprd-us-east1-d:uprade: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929
That job is now complete
Ready up cluster d

./bin/set-server-state gprd ready api-gke-us-east1-d
./bin/set-server-state gprd ready git-https-gke-us-east1-d
./bin/set-server-state gprd ready registry-us-east1-d
./bin/set-server-state gprd ready shell-gke-us-east1-d
./bin/set-server-state gprd ready ws-gke-us-east1-d
./bin/set-server-state gprd ready web-gke-us-east1-d

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 1 minute

Confirm all jobs on Pipeline: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929 are complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

There is no rollback procedure. This chart version is running well on our regional cluster in Production, and is also running well on all other lower environments. If we are impacting our metrics (described below) we must slow down the rate of change. We are targeting this for low traffic times to better our chances of impacting customers as least as possible.

Monitoring

Key metrics to observe

Metric: Apdex and Error SLOs
- Location: https://dashboards.gitlab.net/d/general-public-splashscreen/general-gitlab-dashboards?orgId=1&from=now-1h&to=now
- What changes to this metric should prompt a rollback: Instead of rolling back, we should slow down the rate of change.

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Changes checklist

Edited Sep 15, 2021 by Graeme Gillies