Install Kubernetes Web deployment into production
Production Change
Change Summary
This CR is to safely rollout the changes necessary to install the Web deployment (what would replace our Web VM's) into Kubernetes. This involves a new node pool to be created as well as the Deployment objects. Note that we will not accept traffic into the new deployments outside of healthchecks from internal components. When the Pods come online, they'll be limited to a single Pod. This will enable teamDelivery to perform the necessary audits of configurations and observability leading up to the readiness reviews for this service prior to accepting customer traffic.
Change Details
- Services Impacted - ServiceWeb
- Change Technician - @skarbek
- Change Reviewer - @hphilipps
- Time tracking - 2 hours
- Downtime Component - 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 Minute
-
Set label changein-progress on this issue -
Receive approval on MR: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2796 -
Receive approval on MR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1049 (merged)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 1 hour
-
Merge and Apply: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2796 -
Merge and Monitor Rollout of: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1049 (merged) -
Perform Post-Change Steps -
Rebase and get approval on MR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1064 (merged) -
Merge and Monitor Rollout of: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1064 (merged) -
Perform Post-Change Steps
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10 minutes
-
Validate puma is not receiving customer traffic: https://log.gprd.gitlab.net/goto/13c958bdaede584acf79783d3c169882 -
If we are seeing customer traffic, begin the Rollback steps immediately -
Validate workhorse is not receiving customer traffic: https://log.gprd.gitlab.net/goto/289ca24e1b2130302266a0b209f49046 -
If we are seeing customer traffic, begin the Rollback steps immediately
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1 Hour
-
Drain Canary: /chatops run canary --disable --production
-
Revert MR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1064 (merged) -
Revert MR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1049 (merged) -
Revert MR: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2796 -
Re-enable Canary: /chatops run canary --enable --production
Monitoring
Key metrics to observe
Note that this newly created service will not receive traffic. But will talk to redis/postgrest/etc. Therefore, we limit the deployments to 1 single Pod. We'll still monitor the webservice for abnormalities.
- Metric: Apdex/Error SLO on Web fleet
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1
- What changes to this metric should prompt a rollback: Violation of SLO thresholds
Logs
-
Puma: https://log.gprd.gitlab.net/goto/13c958bdaede584acf79783d3c169882 -
Workhorse: https://log.gprd.gitlab.net/goto/289ca24e1b2130302266a0b209f49046
Summary of infrastructure changes
-
Does this change introduce new compute instances? Yes -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
We are creating a new node pool for 4 clusters. This involves creating a minimum of at least 3 nodes in our regional cluster for canary, and 3 nodes in each of our zonal clusters for a total of 6 nodes. The workloads associated with these will be limited to a single Pod running what would be our web
service.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.