Migrate web traffic in production from VMs to Kubernetes

Production Change

Change Summary

As part of delivery#1894 (closed) we wish to convert production web traffic from being run inside virtual machines to being run inside Kubernetes instead.

In order to do this, we will do a staged rollout where we leverage haproxy weights to slowly switch small amounts of traffic from running on Virtual Machines to running on Kubernetes instead. This will be done over a long period with close monitoring to ensure there are no issues.

Change Details

  1. Services Impacted - ServiceWeb
  2. Change Technician - @ggillies
  3. Change Reviewer - @skarbek
  4. Time tracking - 9.5 hours (570 minutes)
  5. Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 540 (9 hours)

  • Merge and Apply MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1124 (merged)
  • Begin monitoring - See monitoring section below
  • Validate we've not added any additional VM's into our infrastructure: ./bin/get-weights gprd web
    • If yes, modify the below steps to account for such
  • Shift traffic between VMs and Kubernetes in a slow manner
    • Change 2.3% of traffic to GKE
    # Sets HAProxy to send GKE 2.3% of the traffic
    ./bin/set-weights gprd 'web-gke' 100
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 5% of traffic to GKE
    # Sets HAProxy to send GKE 5% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 40
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 10% of traffic to GKE
    # Sets HAProxy to send GKE 10% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 20
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 20% of traffic to GKE
    # Sets HAProxy to send GKE 20% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 10
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 32% of traffic to GKE
    # Sets HAProxy to send GKE 32% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 5
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 44% of traffic to GKE
    # Sets HAProxy to send GKE 44% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 3
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 54% of traffic to GKE
    # Sets HAProxy to send GKE 54% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 2
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 68% of traffic to GKE
    # Sets HAProxy to send GKE 68% of the traffic (via reweighing web vms)
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 1
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Change 97% of traffic to GKE
    # Sets HAProxy to send GKE 97% of the traffic (via reweighing web vms) **Note: canary will take the last 3% of traffic**
    ./bin/set-weights gprd 'web-[0,1,2,3,4]' 0
    date -u # Note this down to put on issue
    • Add comment to this issue noting when this weighting change was made, and Monitor for 60 minutes
    • Validate and Document the current state: ./bin/get-weights gprd web
      • It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100, and canary backend will have a weight of 3
    • Once everything looks good, merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488 to make fe nodes slowly remove web nodes from circulation and just keep k8s backends
      • Roll out 1 node at a time with
      bundle exec knife ssh 'roles:gprd-base-lb-fe' 'sudo chef-client' -C1

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5

  • Validate and Document the current state: ./bin/get-weights gprd web
    • It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100, and canary backend will have a weight of 3

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10

  • If https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/488 has been merged, create a rollback MR for it and apply
  • Set the weight of our VM's back to their default values: ./bin/set-weights gprd 'web-[0,1,2,3,4]' 100
  • Set the weight of our Kubernetes endpoint to 0: ./bin/set-weights gprd gke-web 0
  • Validate the current state: ./bin/get-weights gprd web
    • It is expected that all VM's have a weight of 100, and that our Kubernetes endpoint will have a weight of 0
    • The Kubernetes endpoint for Canary will have a weight of 3 in the web backend

Monitoring

Key metrics to observe

Other dashboards to keep an eye on:

Zonal cluster pod info dashboards

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Graeme Gillies