Transition GKE API Endpoint to HAProxy Canary Stage
Production Change
Change Summary
I'd like to prep the canary stage with the GKE endpoint of our API service. Currently we cannot add a backend with a default weight of 0. This procedure will ensure that we add the endpoint to our canary backends ensuring that we take in as little traffic as possible until we are ready to start sending traffic. This is a more involved configuration change that requires a bit of management of our existing fleet of servers.
This will be a phased rollout as well. The API endpoint in Kubernetes will only take traffic for a short period of time both for evaluation, as well as bolstering evidence for our Service Discovery issue: gitlab-org/gitlab#271575 (closed)
We'll remove traffic to this new endpoint temporarily to prevent potential issues with the upcoming PG12 ugprade scheduled for May 8th. Then we will re-enable the traffic and ramp down our VM's such that all traffic is on Kuberentes.
Change Details
- Services Impacted - ServiceHAProxy ServiceAPI
- Change Technician - @skarbek
- Change Criticality - C3
- Change Type - changeunscheduled
- Change Reviewer - @hphilipps
- Due Date - 2020-05-12
- Time tracking - 4 days
- Downtime Component - 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Set label changein-progress on this issue -
Get server weights and document them for the purposes of our roll back procedure ./bin/get-weights gprd api -
Validate no active deployments are occurring on the canary stage - ask a @release-manager
Change Steps - steps to take to execute the change
Adding the GKE endpoint
Estimated Time to Complete (mins) - 30 minutes
-
Drain canary to make this rollout simple /chatops run canary --disable --production -
Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5412 -
Converge knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2 -
Set weight on canary_api backend GKE to 0 ./bin/set-weights gprd gke-cny-api 0 -
Validate the state of weights to ensure that the GKE endpoint will not take traffic (it's weight should be zero, while the VM's should continue to represent the values documented earlier in our Pre-Steps ./bin/get-weights gprd api -
Re-enable canary /chatops run canary --ready --production
Enable API to take traffic
Estimated Time to Complete (mins) - 20 minutes
-
Set weight on api backend GKE canary stage to match our VM's (this is less than 1% of all API traffic) ./bin/set-weights gprd gke-cny-api 20 api -
Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous -
Set weight on the canary_api backend GKE canary stage to match our VM's (this is 20% of all traffic that reaches the canary_api backend) ./bin/set-weights gprd gke-cny-api 1 canary_api -
Validate the state of weights, the GKE endpoint should match our VM's ./bin/get-weights gprd api
PRIOR TO 2021-05-08
Estimated Time to Complete (mins) - 5 minutes
-
Set weight on both backends to 0 ./bin/set-weights gprd gke-cny-api 0 -
Validate the state of weights, the GKE endpoint should be 0 ./bin/get-weights gprd api
AFTER 2021-05-08
Estimated Time to Complete (mins) - 2 days
-
Set weight on api backend GKE canary stage to match our VM's ./bin/set-weights gprd gke-cny-api 20 api -
Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous -
Set weight on the canary_api backend GKE canary stage to match our VM's ./bin/set-weights gprd gke-cny-api 1 canary_api -
Validate the state of weights, the GKE endpoint should match our VM's ./bin/get-weights gprd api -
Wait 1 day - use this time to view our Monitoring section looking for anything erroneous -
Set weight on api backend VMs canary stage to 0 ./bin/set-weights gprd api-cny-01 0 api sleep 60 ./bin/set-weights gprd api-cny-02 0 api sleep 60 ./bin/set-weights gprd api-cny-03 0 api sleep 60 ./bin/set-weights gprd api-cny-04 0 api -
Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous -
Set weight on the canary_api backend VM's canary stage to 0 ./bin/set-weights gprd api-cny-01 0 canary_api sleep 60 ./bin/set-weights gprd api-cny-02 0 canary_api sleep 60 ./bin/set-weights gprd api-cny-03 0 canary_api sleep 60 ./bin/set-weights gprd api-cny-04 0 canary_api -
Execute QA: Click Run Pipeline: https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines -
Wait for QA to finish for any failures, begin the rollback procedure
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 1 Minute
-
Validate the state of weights, the GKE endpoint should match those of our VM's ./bin/get-weights gprd api
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Drain Canary /chatops run canary --disable --production -
Configure the weights such that the GKE endpoint has a weight of 0 and our VM's match the documented state prior to this rollout ./bin/set-weights gprd api-cny-0 20 canary_api ./bin/set-weights gprd api-cny-0 1 api -
Re-enable canary /chatops run canary --ready --production -
Revert: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5412 -
Converge knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2 -
Validate the state of weights, the values listed should match that of what is documented during our Pre steps ./bin/get-weights gprd api
Monitoring
Key metrics to observe
- Metric: Aggregated Apdex/Aggregated Error Ratio/Puma SLI Apdex/Workhorse SLI Apdex
- Location: * https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
- What changes to this metric should prompt a rollback: Violation of the SLO/SLI
Other Areas to monitor for awkward behavior:
- https://dashboards.gitlab.net/d/api-pod/api-pod-info?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-cluster=gprd-gitlab-gke&var-stage=cny&var-namespace=gitlab-cny&var-Node=All&var-Deployment=gitlab-(cny-)%3Fwebservice-api
- https://dashboards.gitlab.net/d/api-kube-containers/api-kube-containers-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
- https://dashboards.gitlab.net/d/api-kube-deployments/api-kube-deployment-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
- Metric: Excessive Error Logging
- Location: https://log.gprd.gitlab.net/goto/652d82648e9aaa51e20f8a99b8596a3b
- What changes to this metric should prompt a rollback: Engineer best judgement
Summary of infrastructure changes
-
Does this change introduce new compute instances? Yes -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
We are adding the GKE Endpoint for the API Service Canary Stage into HAProxy and sending traffic to it
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.