Transition GKE API Endpoint to HAProxy Canary Stage

Production Change

Change Summary

I'd like to prep the canary stage with the GKE endpoint of our API service. Currently we cannot add a backend with a default weight of 0. This procedure will ensure that we add the endpoint to our canary backends ensuring that we take in as little traffic as possible until we are ready to start sending traffic. This is a more involved configuration change that requires a bit of management of our existing fleet of servers.

This will be a phased rollout as well. The API endpoint in Kubernetes will only take traffic for a short period of time both for evaluation, as well as bolstering evidence for our Service Discovery issue: gitlab-org/gitlab#271575 (closed)

We'll remove traffic to this new endpoint temporarily to prevent potential issues with the upcoming PG12 ugprade scheduled for May 8th. Then we will re-enable the traffic and ramp down our VM's such that all traffic is on Kuberentes.

Change Details

Services Impacted - ServiceHAProxy ServiceAPI
Change Technician - @skarbek
Change Criticality - C3
Change Type - changeunscheduled
Change Reviewer - @hphilipps
Due Date - 2020-05-12
Time tracking - 4 days
Downtime Component - 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Set label changein-progress on this issue
Get server weights and document them for the purposes of our roll back procedure
```
./bin/get-weights gprd api
```
Validate no active deployments are occurring on the canary stage - ask a @release-manager

Change Steps - steps to take to execute the change

Adding the GKE endpoint

Estimated Time to Complete (mins) - 30 minutes

Drain canary to make this rollout simple

/chatops run canary --disable --production

Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5412

Converge

knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2

Set weight on canary_api backend GKE to 0
```
./bin/set-weights gprd gke-cny-api 0
```
Validate the state of weights to ensure that the GKE endpoint will not take traffic (it's weight should be zero, while the VM's should continue to represent the values documented earlier in our Pre-Steps
```
 ./bin/get-weights gprd api
```

Re-enable canary

/chatops run canary --ready --production

Enable API to take traffic

Estimated Time to Complete (mins) - 20 minutes

Set weight on api backend GKE canary stage to match our VM's (this is less than 1% of all API traffic)
```
 ./bin/set-weights gprd gke-cny-api 20 api
```
Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous
Set weight on the canary_api backend GKE canary stage to match our VM's (this is 20% of all traffic that reaches the canary_api backend)
```
./bin/set-weights gprd gke-cny-api 1 canary_api
```
Validate the state of weights, the GKE endpoint should match our VM's
```
 ./bin/get-weights gprd api
```

PRIOR TO 2021-05-08

Estimated Time to Complete (mins) - 5 minutes

Set weight on both backends to 0
```
 ./bin/set-weights gprd gke-cny-api 0
```
Validate the state of weights, the GKE endpoint should be 0
```
 ./bin/get-weights gprd api
```

AFTER 2021-05-08

Estimated Time to Complete (mins) - 2 days

Set weight on api backend GKE canary stage to match our VM's
```
 ./bin/set-weights gprd gke-cny-api 20 api
```
Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous
Set weight on the canary_api backend GKE canary stage to match our VM's
```
./bin/set-weights gprd gke-cny-api 1 canary_api
```
Validate the state of weights, the GKE endpoint should match our VM's
```
 ./bin/get-weights gprd api
```
Wait 1 day - use this time to view our Monitoring section looking for anything erroneous

Set weight on api backend VMs canary stage to 0

 ./bin/set-weights gprd api-cny-01 0 api
 sleep 60
 ./bin/set-weights gprd api-cny-02 0 api
 sleep 60
 ./bin/set-weights gprd api-cny-03 0 api
 sleep 60
 ./bin/set-weights gprd api-cny-04 0 api

Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous

Set weight on the canary_api backend VM's canary stage to 0

 ./bin/set-weights gprd api-cny-01 0 canary_api
 sleep 60
 ./bin/set-weights gprd api-cny-02 0 canary_api
 sleep 60
 ./bin/set-weights gprd api-cny-03 0 canary_api
 sleep 60
 ./bin/set-weights gprd api-cny-04 0 canary_api

Execute QA: Click Run Pipeline: https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines
Wait for QA to finish for any failures, begin the rollback procedure

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 1 Minute

Validate the state of weights, the GKE endpoint should match those of our VM's
```
 ./bin/get-weights gprd api
```

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Drain Canary

/chatops run canary --disable --production

Configure the weights such that the GKE endpoint has a weight of 0 and our VM's match the documented state prior to this rollout
```
 ./bin/set-weights gprd api-cny-0 20 canary_api
 ./bin/set-weights gprd api-cny-0 1 api
```

Re-enable canary

/chatops run canary --ready --production

Revert: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5412

Converge

knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2

Validate the state of weights, the values listed should match that of what is documented during our Pre steps
```
 ./bin/get-weights gprd api
```

Monitoring

Key metrics to observe

Metric: Aggregated Apdex/Aggregated Error Ratio/Puma SLI Apdex/Workhorse SLI Apdex
- Location: * https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
- What changes to this metric should prompt a rollback: Violation of the SLO/SLI

Other Areas to monitor for awkward behavior:

Metric: Excessive Error Logging
- Location: https://log.gprd.gitlab.net/goto/652d82648e9aaa51e20f8a99b8596a3b
- What changes to this metric should prompt a rollback: Engineer best judgement

Summary of infrastructure changes

Does this change introduce new compute instances? Yes
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

We are adding the GKE Endpoint for the API Service Canary Stage into HAProxy and sending traffic to it

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 10, 2021 by John Skarbek