Upgrade gke zonal cluster `gprd-us-east1-c`
Production Change
Change Summary
All 4 gke clusters in gprd
need to be upgrade to 1.17. All other environments have been upgraded as detailed at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11796
This change issue is for cluster gprd-us-east1-c
only.
Change Details
-
Services Impacted - All services in
gke-us-east1-c
, git+https, git+ssh (small amount of traffic being sent to it), websockets, and prometheus monitoring infra, fluentd log collector - Change Technician - @ggillies
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @msmiley
- Due Date - 2020-11-18 00:00
- Time tracking - 6 hours
- Downtime Component - No need for downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5
-
Contact EOC to confirm that it is ok to perform the upgrade (no outstanding incidents, and nothing else that would stop this change request going ahead)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120-360 minutes
-
Merge and apply MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2209 to enable auto upgrades in node pools only (to not force them to upgrade the same time we upgrade the master) -
Merge and apply MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2210 to upgrade the GKE masters. -
Watch the upgrade operation on console-01-sv-gprd.c.gitlab-production.internal
usinggcloud --project gitlab-production container operations list
and confirm it succeeds -
Merge and apply MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2213 to disable auto upgrade on the node pools (which will force them to immediately upgrade) -
Watch the node pool upgrade operations on console-01-sv-gprd.c.gitlab-production.internal
usinggcloud --project gitlab-production container operations list
and confirm when they succeed
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
run gcloud --project gitlab-production container clusters list
and confirm that the master and node pool versions are1.17.13-gke.1400
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - N/A
Due the nature of GKE upgrades, there is unfortunately no ability for us to rollback. As this is a zonal cluster upgrade, if something goes wrong we have the ability to stop sending traffic to that entire gke cluster with
cd chef-repo
./bin/set-server-state gprd drain git-gke-us-east1-c
./bin/set-server-state gprd drain gitlab-shell-us-east1-c
After which we can reach out to Google support with a sev 1 issue to attempt to recover the cluster. In the case of entire catastrophic failure, we can destroy the cluster and recreate it using terraform (and bootstrap it following instructions at https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/k8s-new-cluster.md
Monitoring
Key metrics to observe
-
Metric:
haproxy_ssh_request_duration_seconds_count
haproxy_ssh_request_duration_seconds_bucket
- Location: https://dashboards.gitlab.net/d/-UvftW1iz/ssh-performance?orgId=1
- What changes to this metric should prompt a rollback: An abnormal increase in the duration that starts happening after an upgrade and doesn't settle after an upgrade is complete
-
Metric:
gitlab_service_apdex:ratio{type="git"}
- Location: https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1
- What changes to this metric should prompt a rollback: Our latency or error ratios are high enough to violate our apdex
Summary of infrastructure changes
-
Does this change introduce new compute instances? This does essentially introduce new nodes in that new ones are built and the old ones discarded -
Does this change re-size any existing compute instances? It does not -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? It Does not
Upgrading the nodes of a GKE cluster essentially involves create entirely new nodes, moving workloads, and then deleting the old ones. As the nodes are running Google COS, there is nothing of value stored on them anyway.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.