Complete Destructive Chart bump rollout in Production

Production Change

Change Summary

We need to finish the rollout of a chart bump which resolves issue: delivery#1992 (closed)

This CR is to complete this change that is documented in this incident: #5539 (comment 677613287)

Change Details

  1. Services Impacted - ServiceAPI ServiceGitlab Shell ServiceWeb ServiceWebsockets ServiceContainer Registry ServiceKAS
  2. Change Technician - @skarbek
  3. Change Reviewer - @cmiskell @pguinoiseau @ggillies
  4. Time tracking - 1 hour
  5. Downtime Component - better be 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 hour

  • Using the get-server-state and set-server-state scripts in our chef repo, we'll modify the state of our GKE backends

  • Put cluster B into maintenance:

./bin/set-server-state gprd maint api-gke-us-east1-b
./bin/set-server-state gprd maint git-https-gke-us-east1-b
./bin/set-server-state gprd maint registry-us-east1-b
./bin/set-server-state gprd maint shell-gke-us-east1-b
./bin/set-server-state gprd maint ws-gke-us-east1-b
./bin/set-server-state gprd maint web-gke-us-east1-b
./bin/set-server-state gprd ready api-gke-us-east1-b
./bin/set-server-state gprd ready git-https-gke-us-east1-b
./bin/set-server-state gprd ready registry-us-east1-b
./bin/set-server-state gprd ready shell-gke-us-east1-b
./bin/set-server-state gprd ready ws-gke-us-east1-b
./bin/set-server-state gprd ready web-gke-us-east1-b
  • Put cluster C into maintenance:
./bin/set-server-state gprd maint api-gke-us-east1-c
./bin/set-server-state gprd maint git-https-gke-us-east1-c
./bin/set-server-state gprd maint registry-us-east1-c
./bin/set-server-state gprd maint shell-gke-us-east1-c
./bin/set-server-state gprd maint ws-gke-us-east1-c
./bin/set-server-state gprd maint web-gke-us-east1-c
./bin/set-server-state gprd ready api-gke-us-east1-c
./bin/set-server-state gprd ready git-https-gke-us-east1-c
./bin/set-server-state gprd ready registry-us-east1-c
./bin/set-server-state gprd ready shell-gke-us-east1-c
./bin/set-server-state gprd ready ws-gke-us-east1-c
./bin/set-server-state gprd ready web-gke-us-east1-c
  • Put cluster D into maintenance:
./bin/set-server-state gprd maint api-gke-us-east1-d
./bin/set-server-state gprd maint git-https-gke-us-east1-d
./bin/set-server-state gprd maint registry-us-east1-d
./bin/set-server-state gprd maint shell-gke-us-east1-d
./bin/set-server-state gprd maint ws-gke-us-east1-d
./bin/set-server-state gprd maint web-gke-us-east1-d
./bin/set-server-state gprd ready api-gke-us-east1-d
./bin/set-server-state gprd ready git-https-gke-us-east1-d
./bin/set-server-state gprd ready registry-us-east1-d
./bin/set-server-state gprd ready shell-gke-us-east1-d
./bin/set-server-state gprd ready ws-gke-us-east1-d
./bin/set-server-state gprd ready web-gke-us-east1-d

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 1 minute

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

There is no rollback procedure. This chart version is running well on our regional cluster in Production, and is also running well on all other lower environments. If we are impacting our metrics (described below) we must slow down the rate of change. We are targeting this for low traffic times to better our chances of impacting customers as least as possible.

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances? No
  • Does this change re-size any existing compute instances? No
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Graeme Gillies