Migrate web traffic in production from VMs to Kubernetes
Production Change
Change Summary
As part of delivery#1894 (closed) we wish to convert production web traffic from being run inside virtual machines to being run inside Kubernetes instead.
In order to do this, we will do a staged rollout where we leverage haproxy weights to slowly switch small amounts of traffic from running on Virtual Machines to running on Kubernetes instead. This will be done over a long period with close monitoring to ensure there are no issues.
Change Details
- Services Impacted - ServiceWeb
-
Change Technician -
@ggillies - Change Reviewer - @skarbek
- Time tracking - 9.5 hours (570 minutes)
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10
-
Set label changein-progress on this issue -
Work with release-managersin order to make sure promotion to production does not happen during a window of changing traffic (but rather a period of observation) -
Get approval for MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1124 (merged) -
Get approvial for MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 540 (9 hours)
-
Merge and Apply MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1124 (merged) -
Begin monitoring - See monitoring section below -
Validate we've not added any additional VM's into our infrastructure: ./bin/get-weights gprd web- If yes, modify the below steps to account for such
-
Shift traffic between VMs and Kubernetes in a slow manner -
Change 2.3% of traffic to GKE
# Sets HAProxy to send GKE 2.3% of the traffic ./bin/set-weights gprd 'web-gke' 100 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 5% of traffic to GKE
# Sets HAProxy to send GKE 5% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 40 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 10% of traffic to GKE
# Sets HAProxy to send GKE 10% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 20 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 20% of traffic to GKE
# Sets HAProxy to send GKE 20% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 10 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 32% of traffic to GKE
# Sets HAProxy to send GKE 32% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 5 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 44% of traffic to GKE
# Sets HAProxy to send GKE 44% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 3 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 54% of traffic to GKE
# Sets HAProxy to send GKE 54% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 2 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 68% of traffic to GKE
# Sets HAProxy to send GKE 68% of the traffic (via reweighing web vms) ./bin/set-weights gprd 'web-[0,1,2,3,4]' 1 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Change 97% of traffic to GKE
# Sets HAProxy to send GKE 97% of the traffic (via reweighing web vms) **Note: canary will take the last 3% of traffic** ./bin/set-weights gprd 'web-[0,1,2,3,4]' 0 date -u # Note this down to put on issue-
Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes -
Validate and Document the current state: ./bin/get-weights gprd web- It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100, and canary backend will have a weight of 3
-
Once everything looks good, merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488 to make fe nodes slowly remove web nodes from circulation and just keep k8s backends -
Roll out 1 node at a time with
bundle exec knife ssh 'roles:gprd-base-lb-fe' 'sudo chef-client' -C1 -
-
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Validate and Document the current state: ./bin/get-weights gprd web- It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100, and canary backend will have a weight of 3
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10
-
If https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488 has been merged, create a rollback MR for it and apply -
Set the weight of our VM's back to their default values: ./bin/set-weights gprd 'web-[0,1,2,3,4]' 100 -
Set the weight of our Kubernetes endpoint to 0: ./bin/set-weights gprd gke-web 0 -
Validate the current state: ./bin/get-weights gprd web- It is expected that all VM's have a weight of
100, and that our Kubernetes endpoint will have a weight of0 - The Kubernetes endpoint for Canary will have a weight of
3in thewebbackend
- It is expected that all VM's have a weight of
Monitoring
Key metrics to observe
-
Metric: Apdex and Error Ratios
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1
- What changes to this metric should prompt a rollback: Violation of Service Apdex and Error Ratios SLIs for puma, workhorse, and lb SLIs
-
Metric: Backends Up
- Location: https://thanos.gitlab.net/graph?g0.expr=haproxy_backend_up%7Benvironment%3D%22gprd%22%2C%20backend%3D%22web%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: If any of the gke backends goes to 0, that is, the backend is not up
-
Metric: 99th and 95th percentile of request duration for both vm and k8s
- Location: https://log.gprd.gitlab.net/goto/7ced17f5026ec690577a6dcfc87a193d
- What changes to this metric should prompt a rollback: If the percentiles for Kubernetes start to grow a lot with no recovery after a few minutes (it's possible this spikes up a little bit as the autoscaler brings on new pods, which might be slower than demand, but the percentiles should come back down after 1 minute).
-
Queue Timing: 99th and 95th percentile of queuing time for both vm and k8s
- Location: https://log.gprd.gitlab.net/goto/9011f1616ced3815513bef19389f6541
- What changes to this metric should prompt a rollback: If Kubernetes sustains a high rate of queuing in comparison to VM's we are showing signs of saturation that must be further investigated. There are constant spikes with this metric so do not make abrupt judgement calls on this one.
-
Metric: 00th and 95th percentile of request duration for workhorse on both vms and kubernetes
- Location: https://log.gprd.gitlab.net/goto/569fe07aae877b776d40d9a67614ec92
- What changes to this metric should prompt a rollback: If Kubernetes sustains a elevated rate of duration coparatively to VM's, we may be showing signs of saturation. Consider investigating prior to rollback if able.
Other dashboards to keep an eye on:
Zonal cluster pod info dashboards
- cluster b pod info
- cluster c pod info
- cluster d pod info
- Errors of rails: https://log.gprd.gitlab.net/goto/b6ccffdb1d14fd3ceb9ed4111564767e
- Errors of workhorse: https://log.gprd.gitlab.net/goto/d05071a5d3f299b393ee91e5be9e4899
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.