Skip to content

Deploy GKE enabled API service into production

Production Change

Status

Change Summary

This CR will focus on two things, enabling the API service deployed in Kubernetes to start seeing some traffic, and then after all zones have converged, we'll start moving more traffic into Kubernetes. We'll perform these two actions as two distinct steps in this CR. Upon completion of this CR, the VM's will not be taking any traffic at all.

During our first phase, we'll modify one zone at a time. We'll add the GKE endpoint, which as chef converges HAProxy, will begin to take 3% of the API traffic being taken into the zone being operated on. In between each zone operation, we'll observe our metrics looking for anything suspicious. At the end of this phase, Kubernetes will be taking roughly 3% of all API traffic across all GKE clusters.

The second phase will focus on phasing traffic out of our VM's. We'll slowly pull each VM out of rotation by modifying the traffic weights in HAProxy. All servers weights will be modified due to the mechanism that HAProxy calculates weight. As the weight shifts higher unto Kubernetes, we'll slow it down ensuring that our clusters are scaling and handling the traffic as desired. At the end of this phase, the VM's should not be accepting any traffic from customers.

Reference:

Change Details

  1. Services Impacted - ServiceAPI
  2. Change Technician - @skarbek
  3. Change Criticality - C3
  4. Change Type - changescheduled
  5. Change Reviewer - @ggillies
  6. Due Date - 2021-05-13
  7. Time tracking - 2 working days
  8. Downtime Component - None Estimated

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Change Steps - steps to take to execute the change

Initial Enablement

Estimated Time to Complete (mins) - 8 hours

  • Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5526
  • Execute Chef in a controlled fashion on all LB nodes: knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2
    • This step enables the cluster in zone b to start seeing roughly 3% of all API traffic in zone b
    • The GKE clusters in zone c and d will not yet see any traffic
  • Validate this new backend is in state UP: ./bin/get-server-state gprd api
  • Wait 1 hour - Utilize this time to observe our metrics - See monitoring section below
  • Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5527
  • Execute Chef in a controlled fashion on all LB nodes: knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2
    • This step enables the cluster in zone c to start seeing roughly 3% of all API traffic in zone c
    • The GKE cluster in zone d will not yet see any traffic
  • Validate this new backend is in state UP: ./bin/get-server-state gprd api
  • Wait 1 hour - Utilize this time to observe our metrics - See monitoring section below
  • Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5528
  • Execute Chef in a controlled fashion on all LB nodes: knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2
  • Validate this new backend is in state UP: ./bin/get-server-state gprd api
  • Wait 1 day - Utilize this time to observe our metrics - See monitoring section below

Full Transition

Estimated Time to Complete (mins) - 3 hours

This step will transition all traffic off of VM's in a slow fashion. During the sleep time, observe our metrics as defined below.

  • Begin monitoring - See monitoring section below
  • Validate we've not added any additional VM's into our infrastructure: ./bin/get-weights gprd api
    • If yes, modify the below steps to account for such
  • Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5525
    • We'll let chef converge slowly on its own for this change
  • Disable the VM's using weight modifications in a slow manner:
    # Sets HAProxy to send GKE 12% of the traffic
    ./bin/set-weights gprd 'api-[0,1,2,3]' 20
    sleep 600
    # Sets HAProxy to send GKE 22% of the traffic
    ./bin/set-weights gprd 'api-[0,1,2,3]' 10
    sleep 600
    # Sets HAProxy to send GKE 35% of the traffic
    ./bin/set-weights gprd 'api-[0,1,2,3]' 5
    sleep 600
    # Sets HAProxy to send GKE 57% of the traffic
    ./bin/set-weights gprd 'api-[0,1,2,3]' 2
    sleep 600
    # Sets HAProxy to send GKE 71% of the traffic
    ./bin/set-weights gprd 'api-[0,1,2,3]' 1
    sleep 600
    # Sets HAProxy to send GKE 95% of the traffic
    ./bin/set-weights gprd 'api-[0,1,2,3]' 0
    # The last 5% of the traffic is canary
  • Validate and Document the current state: ./bin/get-weights gprd api
    • It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10 minutes

  • Validate the current state: ./bin/get-weights gprd api
    • It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100
    • The Canary VM's will have a weight of 0, the Kubernetes endpoint for Canary will have a weight of 20 in the api backend, and 1 in the canary_api backend

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

  • Set the weight of our VM's back to their default values: ./bin/set-weights gprd api- 100
  • Set the weight of our Canary VM's back to their pre change values: ./bin/set-weights gprd api-cny-0 20 api
  • Set the weight of our Kubernetes endpoint to 0: ./bin/set-weights gprd gke-api 0
  • Validate the current state: ./bin/get-weights gprd api
    • It is expected that all VM's have a weight of 100, and that our Kubernetes endpoint will have a weight of 0
    • The Canary VM's will have a weight of 0, the Kubernetes endpoint for Canary will have a weight of 20 in the api backend, and 1 in the canary_api backend
  • Determine if what caused us to perform the rollback is catastrophic where we must revert the chef changes

Revert MR's

If we've reached a catastrophic stopping point where this CR is unable to continue, revert the following Merge Requests if they were applied.

Monitoring

Key metrics to observe

Other Fun Dashboards

Summary of infrastructure changes

  • Does this change introduce new compute instances? Yes
  • Does this change re-size any existing compute instances? No
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

This will introduce Kubernetes run infrastructure to our /api endpoint

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by John Skarbek