Transition GKE API Endpoint to HAProxy Canary Stage

Production Change

Change Summary

I'd like to prep the canary stage with the GKE endpoint of our API service. Currently we cannot add a backend with a default weight of 0. This procedure will ensure that we add the endpoint to our canary backends ensuring that we take in as little traffic as possible until we are ready to start sending traffic. This is a more involved configuration change that requires a bit of management of our existing fleet of servers.

This will be a phased rollout as well. The API endpoint in Kubernetes will only take traffic for a short period of time both for evaluation, as well as bolstering evidence for our Service Discovery issue: gitlab-org/gitlab#271575 (closed)

We'll remove traffic to this new endpoint temporarily to prevent potential issues with the upcoming PG12 ugprade scheduled for May 8th. Then we will re-enable the traffic and ramp down our VM's such that all traffic is on Kuberentes.

Change Details

  1. Services Impacted - ServiceHAProxy ServiceAPI
  2. Change Technician - @skarbek
  3. Change Criticality - C3
  4. Change Type - changeunscheduled
  5. Change Reviewer - @hphilipps
  6. Due Date - 2020-05-12
  7. Time tracking - 4 days
  8. Downtime Component - 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

  • Set label changein-progress on this issue
  • Get server weights and document them for the purposes of our roll back procedure
    ./bin/get-weights gprd api
  • Validate no active deployments are occurring on the canary stage - ask a @release-manager

Change Steps - steps to take to execute the change

Adding the GKE endpoint

Estimated Time to Complete (mins) - 30 minutes

  • Drain canary to make this rollout simple
    /chatops run canary --disable --production
  • Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5412
  • Converge
    knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2
  • Set weight on canary_api backend GKE to 0
    ./bin/set-weights gprd gke-cny-api 0
  • Validate the state of weights to ensure that the GKE endpoint will not take traffic (it's weight should be zero, while the VM's should continue to represent the values documented earlier in our Pre-Steps
     ./bin/get-weights gprd api
  • Re-enable canary
    /chatops run canary --ready --production

Enable API to take traffic

Estimated Time to Complete (mins) - 20 minutes

  • Set weight on api backend GKE canary stage to match our VM's (this is less than 1% of all API traffic)
     ./bin/set-weights gprd gke-cny-api 20 api
  • Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous
  • Set weight on the canary_api backend GKE canary stage to match our VM's (this is 20% of all traffic that reaches the canary_api backend)
    ./bin/set-weights gprd gke-cny-api 1 canary_api
  • Validate the state of weights, the GKE endpoint should match our VM's
     ./bin/get-weights gprd api

PRIOR TO 2021-05-08

Estimated Time to Complete (mins) - 5 minutes

  • Set weight on both backends to 0
     ./bin/set-weights gprd gke-cny-api 0
  • Validate the state of weights, the GKE endpoint should be 0
     ./bin/get-weights gprd api

AFTER 2021-05-08

Estimated Time to Complete (mins) - 2 days

  • Set weight on api backend GKE canary stage to match our VM's
     ./bin/set-weights gprd gke-cny-api 20 api
  • Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous
  • Set weight on the canary_api backend GKE canary stage to match our VM's
    ./bin/set-weights gprd gke-cny-api 1 canary_api
  • Validate the state of weights, the GKE endpoint should match our VM's
     ./bin/get-weights gprd api
  • Wait 1 day - use this time to view our Monitoring section looking for anything erroneous
  • Set weight on api backend VMs canary stage to 0
     ./bin/set-weights gprd api-cny-01 0 api
     sleep 60
     ./bin/set-weights gprd api-cny-02 0 api
     sleep 60
     ./bin/set-weights gprd api-cny-03 0 api
     sleep 60
     ./bin/set-weights gprd api-cny-04 0 api
  • Wait 10 minutes - use this time to view our Monitoring section looking for anything erroneous
  • Set weight on the canary_api backend VM's canary stage to 0
     ./bin/set-weights gprd api-cny-01 0 canary_api
     sleep 60
     ./bin/set-weights gprd api-cny-02 0 canary_api
     sleep 60
     ./bin/set-weights gprd api-cny-03 0 canary_api
     sleep 60
     ./bin/set-weights gprd api-cny-04 0 canary_api
  • Execute QA: Click Run Pipeline: https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines
  • Wait for QA to finish for any failures, begin the rollback procedure

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 1 Minute

  • Validate the state of weights, the GKE endpoint should match those of our VM's
     ./bin/get-weights gprd api

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Drain Canary
    /chatops run canary --disable --production
  • Configure the weights such that the GKE endpoint has a weight of 0 and our VM's match the documented state prior to this rollout
     ./bin/set-weights gprd api-cny-0 20 canary_api
     ./bin/set-weights gprd api-cny-0 1 api
  • Re-enable canary
    /chatops run canary --ready --production
  • Revert: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5412
  • Converge
    knife ssh 'roles:gprd-base-lb-fe-config' 'sudo chef-client' -C 2
  • Validate the state of weights, the values listed should match that of what is documented during our Pre steps
     ./bin/get-weights gprd api

Monitoring

Key metrics to observe

Other Areas to monitor for awkward behavior:

Summary of infrastructure changes

  • Does this change introduce new compute instances? Yes
  • Does this change re-size any existing compute instances? No
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

We are adding the GKE Endpoint for the API Service Canary Stage into HAProxy and sending traffic to it

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by John Skarbek