Deploy GKE enabled API service into production

Production Change

Status

Change Summary

This CR will focus on two things, enabling the API service deployed in Kubernetes to start seeing some traffic, and then after all zones have converged, we'll start moving more traffic into Kubernetes. We'll perform these two actions as two distinct steps in this CR. Upon completion of this CR, the VM's will not be taking any traffic at all.

During our first phase, we'll modify one zone at a time. We'll add the GKE endpoint, which as chef converges HAProxy, will begin to take 3% of the API traffic being taken into the zone being operated on. In between each zone operation, we'll observe our metrics looking for anything suspicious. At the end of this phase, Kubernetes will be taking roughly 3% of all API traffic across all GKE clusters.

The second phase will focus on phasing traffic out of our VM's. We'll slowly pull each VM out of rotation by modifying the traffic weights in HAProxy. All servers weights will be modified due to the mechanism that HAProxy calculates weight. As the weight shifts higher unto Kubernetes, we'll slow it down ensuring that our clusters are scaling and handling the traffic as desired. At the end of this phase, the VM's should not be accepting any traffic from customers.

Reference:

Change Details

Services Impacted - ServiceAPI
Change Technician - @skarbek
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - @ggillies
Due Date - 2021-05-13
Time tracking - 2 working days
Downtime Component - None Estimated

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Initial Enablement

Estimated Time to Complete (mins) - 8 hours

Full Transition

Estimated Time to Complete (mins) - 3 hours

This step will transition all traffic off of VM's in a slow fashion. During the sleep time, observe our metrics as defined below.

Begin monitoring - See monitoring section below
Validate we've not added any additional VM's into our infrastructure: ./bin/get-weights gprd api
- If yes, modify the below steps to account for such
Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5525
- We'll let chef converge slowly on its own for this change

Disable the VM's using weight modifications in a slow manner:

# Sets HAProxy to send GKE 12% of the traffic
./bin/set-weights gprd 'api-[0,1,2,3]' 20
sleep 600
# Sets HAProxy to send GKE 22% of the traffic
./bin/set-weights gprd 'api-[0,1,2,3]' 10
sleep 600
# Sets HAProxy to send GKE 35% of the traffic
./bin/set-weights gprd 'api-[0,1,2,3]' 5
sleep 600
# Sets HAProxy to send GKE 57% of the traffic
./bin/set-weights gprd 'api-[0,1,2,3]' 2
sleep 600
# Sets HAProxy to send GKE 71% of the traffic
./bin/set-weights gprd 'api-[0,1,2,3]' 1
sleep 600
# Sets HAProxy to send GKE 95% of the traffic
./bin/set-weights gprd 'api-[0,1,2,3]' 0
# The last 5% of the traffic is canary

Validate and Document the current state: ./bin/get-weights gprd api
- It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10 minutes

Validate the current state: ./bin/get-weights gprd api
- It is expected that all VM's have a weight of 0, and that our Kubernetes endpoint will have a weight of 100
- The Canary VM's will have a weight of 0, the Kubernetes endpoint for Canary will have a weight of 20 in the api backend, and 1 in the canary_api backend

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

Set the weight of our VM's back to their default values: ./bin/set-weights gprd api- 100
Set the weight of our Canary VM's back to their pre change values: ./bin/set-weights gprd api-cny-0 20 api
Set the weight of our Kubernetes endpoint to 0: ./bin/set-weights gprd gke-api 0
Validate the current state: ./bin/get-weights gprd api
- It is expected that all VM's have a weight of 100, and that our Kubernetes endpoint will have a weight of 0
- The Canary VM's will have a weight of 0, the Kubernetes endpoint for Canary will have a weight of 20 in the api backend, and 1 in the canary_api backend
Determine if what caused us to perform the rollback is catastrophic where we must revert the chef changes

Revert MR's

If we've reached a catastrophic stopping point where this CR is unable to continue, revert the following Merge Requests if they were applied.

Monitoring

Key metrics to observe

Metric: Apdex and Error Ratios
- Location: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&from=now-1h&to=now&refresh=5s
- What changes to this metric should prompt a rollback: Violation of Service Apdex and Error Ratios SLIs for puma, workhorse, and lb SLIs
Metric: Backends Up
- Location: https://dashboards.gitlab.net/d/ZOOh_aNik/haproxy?orgId=1&refresh=5m
- What changes to this metric should prompt a rollback: If the GKE backend goes down. It's normal to see backends go down during deploys, the GKE should never go down as there will always be a Pod responding to healthchecks
Metric: CI Polling RPS
- Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?orgId=1
- What changes to this metric should prompt a rollback: If the RPS increases dramatically, it could mean that workhorse is failing to go into a long poll state, which will negatively impact the performance characteristics of runners. This behavior can also be seen https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&from=1621440187514&to=1621450987514&viewPanel=94 on the route ^/api/v4/jobs/request\z

Other Fun Dashboards

Summary of infrastructure changes

Does this change introduce new compute instances? Yes
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

This will introduce Kubernetes run infrastructure to our /api endpoint

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 20, 2021 by John Skarbek