[gstg] Traffic Routing Zonal Outage Gameday
C2
Production Change - Criticality 2Change Summary
This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when testing traffic shifts due to zonal outages. Hopefully corrective actions from testing will help us build new steps to take during a real outage.
Gameday execution roles and details
Role | Assignee |
---|---|
Change Technician | @swainaina |
Change Technician II | @thisisshreya |
- Services Impacted - TBD
- Time tracking - 90 minutes
- Downtime Component - 30 minutes
Provide a brief summary indicating the affected zone
[For Gamedays only] Preparation Tasks
-
One week before the gameday make an announcement on slack production_engineering channel. Consider also posting this in the appropriate environment channels, staging or production. - Example message:
Next week on Monday 5PM CET, we will be executing our quarterly gameday. The process will involve moving traffic away from a single zone in gstg and should take approximately 90 minutes. The aim for this exercise is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18142
-
Notify the release managers on Slack by mentioning @release-managers
and referencing this issue and await their acknowledgement. -
Notify the eoc on Slack by mentioning @sre-oncall
and referencing this issue and wait for approval by adding the eoc_approved label.- Example message:
@release-managers or @sre-oncall https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18142 is scheduled for execution. We will be disabling canary we will also be taking a single Gitaly node offline for approximately 30 minutes so as to avoid dataloss. This will only affect projects that are on that Gitaly VM, it won't be used for new projects.
-
Post a similar message to the #test-platform channel on slack.
Detailed steps for the change
Change Steps - steps to take to execute the change
Execution
-
If you are conducting a practice (Gameday) run of this, consider starting a recording of the process now. -
Set label changein-progress /label ~change::in-progress
-
Remove the HAProxy instances from the GCP load balancers: for i in $( gcloud --project=gitlab-staging-1 compute target-pools list --filter="instances ~ us-east1-d.*haproxy" --format="value(name)" ); do \ for j in $(gcloud --project=gitlab-staging-1 compute target-pools describe ${i} --region=us-east1 --format="value(instances)" | sed -E 's/;/\n/g' | awk '/us-east1-d/{print}' ); do \ gcloud --project=gitlab-staging-1 compute target-pools remove-instances $i --region us-east1 \ --instances="${j}" \ --instances-zone=us-east1-d; \ done \ done
-
Disable the HAproxy servers: cd chef-repo ./bin/disable-server gstg us-east1-d
-
Reconfigure the regional cluster to exclude the affected zone by setting regional_cluster_zones
in Terraform to a list of zones that are not impacted👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8663❗ ❗ NOTE❗ ❗ This takes a while to complete (approximately 30 minutes) and it locks terraform jobs, it should be executed last.
Validation
Once traffic is restricted to our remaining two zones, let's identify the impact and look for problems.
-
Do we see a drop in CPU usage in one zone cluster? GSTG Per Cluster CPU Usage -
Do we see a drop in HPA targets in one zone cluster? GSTG Per Cluster HPA Target -
Examine GSTG Rails logs for errors -
Examine frontend dashboard for GSTG
Wrapping up
-
Re-enable HAProxy cd chef-repo ./bin/enable-script gstg us-east1-d
-
Set label changecomplete /label ~change::complete
-
Notify the @release-managers
and@sre-oncall
that the exercise is complete.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
It is estimated that this will take 5m to complete
-
Re-enable HAProxy cd chef-repo ./bin/enable-script gstg us-east1-d
-
Set label changecomplete /label ~change::aborted
-
Notify the @release-managers
and@sre-oncall
that the exercise has been aborted.