[gstg] HAProxy Zonal Outage Game Day

Production Change - Criticality 2 C2

Change Summary

This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when testing traffic shifts due to zonal outages. Hopefully corrective actions from testing will help us build new steps to take during a real outage.

Gameday execution roles and details

Role	Assignee
Change Technician	@thisisshreya
Change Technician II	@cmcfarland

Services Impacted - TBD
Time tracking - 90 minutes
Downtime Component - 30 minutes

Provide a brief summary indicating the affected zone

[For Gamedays only] Preparation Tasks

One week before the gameday make an announcement on slack production_engineering channel.

Example message:

Next week during the [Reliability discussions and firedrills](https://calendar.google.com/calendar/u/0/r/day/2023/12/20) meeting,
we will be executing our quarterly gameday. The process will involve moving traffic away from a single zone in gstgThis should take
approximately 90 minutes, the aim for this exersice is to test our disaster recovery capabilities and measure if we are still within
our RTO & RPO targets set by the  [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/)
for GitLab.com.
See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17274

Notify the release managers on Slack by mentioning @release-managers and referencing this issue and await their acknowledgment.

Notify the eoc on Slack by mentioning @sre-oncall and referencing this issue and wait for approval by adding the eoc_approved label.

Example message:

@release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution. 
We will be disabling canary we will also be taking a single Gitaly node offline for approximately 30 minutes so as to avoid dataloss. 
This will only affect projects that are on that Gitaly VM, it won't be used for new projects.

Post a similar message to the #test-platform channel on slack.

Detailed steps for the change

Change Steps - steps to take to execute the change

Execution

Set label changein-progress /label ~change::in-progress

Remove the HAProxy instances from the GCP load balancers:

for i in $( gcloud --project=gitlab-staging-1 compute target-pools list --filter="instances ~ us-east1-d.*haproxy" --format="value(name)" ); do \
  for j in $(gcloud --project=gitlab-staging-1 compute target-pools describe ${i} --region=us-east1 --format="value(instances)" | sed -E 's/;/\n/g' | awk '/us-east1-d/{print}' ); do \
    gcloud --project=gitlab-staging-1 compute target-pools remove-instances $i --region us-east1 \
    --instances="${j}" \
    --instances-zone=us-east1-d; \
  done \
done

Drain the canary environment by running the following command in the Slack #production channel:
- /chatops run canary --disable --staging
❗❗ NOTE: ❗❗ If there are ongoing deployments you need to confirm with the release manager if you can ignore deployment checks by using the flag --ignore-deployment-check

Ref GitLab Chatops

Validation

Once traffic is restricted to our remaining two zones, let's identify the impact and look for problems.

Do we see a drop in CPU usage in one zone cluster? GSTG Per Cluster CPU Usage
Do we see a drop in HPA targets in one zone cluster? GSTG Per Cluster HPA Target
Examine GSTG Rails logs for errors
Examine frontend dashboard for GSTG

Wrapping up

Re-enable canary
- In the #production Slack channel to re-enable the canary fleet run the command /chatops run canary --enable --staging
Re-enable HAProxy with Terraform
Notify the @release-managers and @sre-oncall that the exersice is complete.
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

It is estimated that this will take 5m to complete

Re-enable canary
- In the #production Slack channel to re-enable the canary fleet run the command /chatops run canary --enable --staging
Re-enable HAProxy with Terraform
Notify the @release-managers and @sre-oncall that the exersice has been aborted.
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Completed Chef runs: Staging
Gitaly Dashboard: Staging
Gitaly RPS by FQDN: Staging
Gitaly Errors by FQDN: Staging
PostgreSQL Overview: Staging
Replication Lag: Staging

Edited May 09, 2024 by Shreya Shah