Migrate web traffic in production from VMs to Kubernetes

Production Change

Change Summary

As part of delivery#1894 (closed) we wish to convert production web traffic from being run inside virtual machines to being run inside Kubernetes instead.

In order to do this, we will do a staged rollout where we leverage haproxy weights to slowly switch small amounts of traffic from running on Virtual Machines to running on Kubernetes instead. This will be done over a long period with close monitoring to ensure there are no issues.

Change Details

Services Impacted - ServiceWeb
Change Technician - @ggillies
Change Reviewer - @skarbek
Time tracking - 9.5 hours (570 minutes)
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10

Set label changein-progress on this issue
Work with release-managers in order to make sure promotion to production does not happen during a window of changing traffic (but rather a period of observation)
Get approval for MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1124 (merged)
Get approvial for MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 540 (9 hours)

Merge and Apply MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1124 (merged)
Begin monitoring - See monitoring section below
Validate we've not added any additional VM's into our infrastructure: ./bin/get-weights gprd web
- If yes, modify the below steps to account for such
Shift traffic between VMs and Kubernetes in a slow manner
- Change 2.3% of traffic to GKE
```
# Sets HAProxy to send GKE 2.3% of the traffic
./bin/set-weights gprd 'web-gke' 100
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 5% of traffic to GKE
```
# Sets HAProxy to send GKE 5% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 40
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 10% of traffic to GKE
```
# Sets HAProxy to send GKE 10% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 20
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 20% of traffic to GKE
```
# Sets HAProxy to send GKE 20% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 10
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 32% of traffic to GKE
```
# Sets HAProxy to send GKE 32% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 5
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 44% of traffic to GKE
```
# Sets HAProxy to send GKE 44% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 3
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 54% of traffic to GKE
```
# Sets HAProxy to send GKE 54% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 2
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 68% of traffic to GKE
```
# Sets HAProxy to send GKE 68% of the traffic (via reweighing web vms)
./bin/set-weights gprd 'web-[0,1,2,3,4]' 1
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Change 97% of traffic to GKE
```
# Sets HAProxy to send GKE 97% of the traffic (via reweighing web vms) **Note: canary will take the last 3% of traffic**
./bin/set-weights gprd 'web-[0,1,2,3,4]' 0
date -u # Note this down to put on issue
```
- Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
- Validate and Document the current state: ./bin/get-weights gprd web
  - It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100, and canary backend will have a weight of 3
- Once everything looks good, merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488 to make fe nodes slowly remove web nodes from circulation and just keep k8s backends
  - Roll out 1 node at a time with
```
bundle exec knife ssh 'roles:gprd-base-lb-fe' 'sudo chef-client' -C1
```

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5

Validate and Document the current state: ./bin/get-weights gprd web
- It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100, and canary backend will have a weight of 3

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10

If https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488 has been merged, create a rollback MR for it and apply
Set the weight of our VM's back to their default values: ./bin/set-weights gprd 'web-[0,1,2,3,4]' 100
Set the weight of our Kubernetes endpoint to 0: ./bin/set-weights gprd gke-web 0
Validate the current state: ./bin/get-weights gprd web
- It is expected that all VM's have a weight of 100, and that our Kubernetes endpoint will have a weight of 0
- The Kubernetes endpoint for Canary will have a weight of 3 in the web backend

Monitoring

Key metrics to observe

Metric: Apdex and Error Ratios
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1
- What changes to this metric should prompt a rollback: Violation of Service Apdex and Error Ratios SLIs for puma, workhorse, and lb SLIs
Metric: Backends Up
- Location: https://thanos.gitlab.net/graph?g0.expr=haproxy_backend_up%7Benvironment%3D%22gprd%22%2C%20backend%3D%22web%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: If any of the gke backends goes to 0, that is, the backend is not up
Metric: 99th and 95th percentile of request duration for both vm and k8s
- Location: https://log.gprd.gitlab.net/goto/7ced17f5026ec690577a6dcfc87a193d
- What changes to this metric should prompt a rollback: If the percentiles for Kubernetes start to grow a lot with no recovery after a few minutes (it's possible this spikes up a little bit as the autoscaler brings on new pods, which might be slower than demand, but the percentiles should come back down after 1 minute).
Queue Timing: 99th and 95th percentile of queuing time for both vm and k8s
- Location: https://log.gprd.gitlab.net/goto/9011f1616ced3815513bef19389f6541
- What changes to this metric should prompt a rollback: If Kubernetes sustains a high rate of queuing in comparison to VM's we are showing signs of saturation that must be further investigated. There are constant spikes with this metric so do not make abrupt judgement calls on this one.
Metric: 00th and 95th percentile of request duration for workhorse on both vms and kubernetes
- Location: https://log.gprd.gitlab.net/goto/569fe07aae877b776d40d9a67614ec92
- What changes to this metric should prompt a rollback: If Kubernetes sustains a elevated rate of duration coparatively to VM's, we may be showing signs of saturation. Consider investigating prior to rollback if able.

Other dashboards to keep an eye on:

Zonal cluster pod info dashboards

cluster b pod info
cluster c pod info
cluster d pod info
Errors of rails: https://log.gprd.gitlab.net/goto/b6ccffdb1d14fd3ceb9ed4111564767e
Errors of workhorse: https://log.gprd.gitlab.net/goto/d05071a5d3f299b393ee91e5be9e4899

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes checklist

Edited Aug 25, 2021 by Graeme Gillies