Complete Destructive Chart bump rollout in Production
Production Change
Change Summary
We need to finish the rollout of a chart bump which resolves issue: delivery#1992 (closed)
This CR is to complete this change that is documented in this incident: #5539 (comment 677613287)
Change Details
- Services Impacted - ServiceAPI ServiceGitlab Shell ServiceWeb ServiceWebsockets ServiceContainer Registry ServiceKAS
- Change Technician - @skarbek
- Change Reviewer - @cmiskell @pguinoiseau @ggillies
- Time tracking - 1 hour
- Downtime Component - better be 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 1 hour
-
Using the
get-server-stateandset-server-statescripts in our chef repo, we'll modify the state of our GKE backends -
Put cluster B into maintenance:
./bin/set-server-state gprd maint api-gke-us-east1-b
./bin/set-server-state gprd maint git-https-gke-us-east1-b
./bin/set-server-state gprd maint registry-us-east1-b
./bin/set-server-state gprd maint shell-gke-us-east1-b
./bin/set-server-state gprd maint ws-gke-us-east1-b
./bin/set-server-state gprd maint web-gke-us-east1-b
-
Play the job for gprd-us-east1-b:uprade: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929 -
That job is now complete -
Ready up cluster b
./bin/set-server-state gprd ready api-gke-us-east1-b
./bin/set-server-state gprd ready git-https-gke-us-east1-b
./bin/set-server-state gprd ready registry-us-east1-b
./bin/set-server-state gprd ready shell-gke-us-east1-b
./bin/set-server-state gprd ready ws-gke-us-east1-b
./bin/set-server-state gprd ready web-gke-us-east1-b
-
Put cluster C into maintenance:
./bin/set-server-state gprd maint api-gke-us-east1-c
./bin/set-server-state gprd maint git-https-gke-us-east1-c
./bin/set-server-state gprd maint registry-us-east1-c
./bin/set-server-state gprd maint shell-gke-us-east1-c
./bin/set-server-state gprd maint ws-gke-us-east1-c
./bin/set-server-state gprd maint web-gke-us-east1-c
-
Play the job for gprd-us-east1-c:uprade: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929 -
That job is now complete -
Ready up cluster c
./bin/set-server-state gprd ready api-gke-us-east1-c
./bin/set-server-state gprd ready git-https-gke-us-east1-c
./bin/set-server-state gprd ready registry-us-east1-c
./bin/set-server-state gprd ready shell-gke-us-east1-c
./bin/set-server-state gprd ready ws-gke-us-east1-c
./bin/set-server-state gprd ready web-gke-us-east1-c
-
Put cluster D into maintenance:
./bin/set-server-state gprd maint api-gke-us-east1-d
./bin/set-server-state gprd maint git-https-gke-us-east1-d
./bin/set-server-state gprd maint registry-us-east1-d
./bin/set-server-state gprd maint shell-gke-us-east1-d
./bin/set-server-state gprd maint ws-gke-us-east1-d
./bin/set-server-state gprd maint web-gke-us-east1-d
-
Play the job for gprd-us-east1-d:uprade: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929 -
That job is now complete -
Ready up cluster d
./bin/set-server-state gprd ready api-gke-us-east1-d
./bin/set-server-state gprd ready git-https-gke-us-east1-d
./bin/set-server-state gprd ready registry-us-east1-d
./bin/set-server-state gprd ready shell-gke-us-east1-d
./bin/set-server-state gprd ready ws-gke-us-east1-d
./bin/set-server-state gprd ready web-gke-us-east1-d
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 1 minute
-
Confirm all jobs on Pipeline: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/792929 are complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
There is no rollback procedure. This chart version is running well on our regional cluster in Production, and is also running well on all other lower environments. If we are impacting our metrics (described below) we must slow down the rate of change. We are targeting this for low traffic times to better our chances of impacting customers as least as possible.
Monitoring
Key metrics to observe
- Metric: Apdex and Error SLOs
- Location: https://dashboards.gitlab.net/d/general-public-splashscreen/general-gitlab-dashboards?orgId=1&from=now-1h&to=now
- What changes to this metric should prompt a rollback: Instead of rolling back, we should slow down the rate of change.
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.