2024-04-11: [GSTG] Gameday Zonal Outage
C2
Production Change - Criticality 2Change Summary
This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when restoring Gitaly VMs in a single zone and adding new patroni replicas in a new zone.
Gameday execution roles and details
Role | Assignee |
---|---|
Change Technician | @cmcfarland |
Change Technician II | @jarv |
- Services Impacted - ServiceGitaly ServicePatroniV14 ServicePatroniCiV14 ServicePatroniRegistryV14
- Time tracking - 90 minutes
- Downtime Component - 30 minutes
Provide a brief summary indicating the affected zone
- Restoring
gitaly-02
inus-east1-b
asgitaly-02a
inus-east1-c
due to the outage inus-east1-b
. - Adding additional capacity for patroni-main, patroni-registry, and patroni-ci in
us-east1-c
due to the outage inus-east1-b
.
[For Gamedays only] Preparation Tasks
-
One week before the gameday make an announcement on slack production_engineering channel. Tomorrow April 11th 13:00 UTC, we will be executing our quarterly gameday. The process will involve moving traffic away from a single zone in gstg, and restoring Gitaly nodes to a new zone as well as increase the capacity for Patroni. This should take approximately 90 minutes, the aim for this exersice is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17813
-
Ensure the MRs to provision Patroni replicas and the Gitaly VMs are rebased and approved. - MR that adds Patroni replicas with zone overrides. https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8125
- MR that adds Gitaly servers within the available zone. https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8126
-
Notify the release managers on Slack by mentioning @release-managers
and referencing this issue and await their acknowledgment. -
Notify the eoc on Slack by mentioning @sre-oncall
and referencing this issue and wait for approval by adding the eoc_approved label.- Example message:
@release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution. We will be disabling canary we will also be taking a single Gitaly node offline for approximately 30 minutes so as to avoid dataloss. This will only affect projects that are on that Gitaly VM, it won't be used for new projects.
-
Post a similar message to the #test-platform channel on slack.
Detailed steps for the change
Change Steps - steps to take to execute the change
Execution
-
Set label changein-progress /label ~change::in-progress
-
Merge the MR that adds a new patroni node, make sure it includes a zone override NOTE: It can take upto 30 minutes before you are able to log in to the new replica. https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8125 -
[Skip during Gameday] Merge the MR to add a new Gitaly server within the available zone NOTE: It can take upto 20 minutes before you are able to log in to the new Gitaly server. https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8126 ❗ ❗ NOTE:❗ ❗ These two steps above⬆ take a while to complete the other steps can be executed in parallel. -
Set the HAProxy backends that correspond to the failed zone, into maintenance. This will route all front-end traffic to the other two zones. chef-repo$ ./bin/set-server-state -z <zone {b,c,d}> b gstg maint
-
Drain the canary environment by running the following command in the Slack #production channel: /chatops run canary --disable --staging
❗ ❗ NOTE:❗ ❗ If there are ongoing deployments you need to confirm with the release manager if you can ignore deployment checks by using the flag--ignore-deployment-check
Ref GitLab Chatops
Validation of the Patroni replica and Gitaly VMs
-
Wait for the instances to be built and Chef to converge. -
Staging: Confirm that chef runs have completed for new storages and patroni replicas it can take upto 30 minutes before they show up.
Trouble shooting tips
- Tail the serial output to confirm that the start up script executed successfully.
gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1
the variables$project
represents the projectgitlab-staging-1
for the patroni replicas or the gitaly project e.ggitaly-gstg-380a
for the gitaly storages,$instance_name
represent the instance e.gpatroni-main-v14-105-db-gstg
orgitaly-02b-stor-gstg
, and$zone
represents the recovery zone e.gus-east1-c
. - We could also tail bootstrap logs example:
tail -f /var/tmp/bootstrap-20231108-133642.log
- Tail the serial output to confirm that the start up script executed successfully.
-
-
Once the patroni replicas are ready ssh into each replica and start patroni: sudo systemctl enable patroni && sudo systemctl start patroni
-
ssh into the Gitaly VMs: - Execute
sudo gitlab-ctl status
to validate that the servers are up - Validate that the disk is properly mounted in thanos.
❗ ❗ NOTE❗ ❗ The graph above is an example to offer guidance you may be required to change some parameters e.g.fqdn
. - Execute
-
Reconfigure the regional cluster to exclude the affected zone by setting regional_cluster_zones
in Terraform to a list of zones that are not impacted❗ ❗ NOTE❗ ❗ This takes a while to complete (approximately 30 minutes) and it locks terraform jobs, it should be executed last.
POINT OF NO RETURN
-
Set the weights to zero on the affected storages. - Staging: With an admin account, navigate to Repository Storage Settings and set the weight to 0 for the affected storages.
Add the new Storages to the Rails and Gitaly configuration
-
[For Gamedays only] Stop the Gilaty node that we intend to take a snapshot. gcloud compute instances stop --project=gitlab-gitaly-gstg-380a --zone="us-east1-b" "gitaly-02-stor-gstg"
gcloud compute snapshots create "file-gitaly-02-gameday-snapshot" --source-disk="gitaly-02-stor-gstg-data" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17189" --project=gitlab-gitaly-gstg-380a --source-disk-zone="us-east1-b"
gcloud compute instances stop --project=gitlab-gitaly-gstg-164c --zone="us-east1-b" "gitaly-02-stor-gstg"
gcloud compute snapshots create "file-gitaly-02-gameday-snapshot" --source-disk="gitaly-02-stor-gstg-data" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17189" --project=gitlab-gitaly-gstg-164c --source-disk-zone="us-east1-b"
❗ ❗ NOTE❗ ❗ Remember to replace project names, the values used above are examples to offer guidance. -
[For Gamedays only] We can now merge the MR that adds a new Gitaly server within the available zone NOTE: It can take upto 20 minutes before you are able to log in to the new Gitaly server. -
Create an MR against chef-repo that adds the new storages to the environment Chef managed configuration. Note it will take around 30 minutes for Chef to converge. -
Create an MR against k8s-workloads/gitlab-com that will add the new storage to the K8s configuration -
Example MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!3260 (merged)
❗ ❗ NOTE❗ ❗ ensure all pipelines have passedqa-reliable
andqa-smoke
test failures can lead to auto deploys being blocked
-
Example MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!3260 (merged)
Validate that the new nodes are receiving traffic
-
❗ ❗ NOTE❗ ❗ The dashboards above are examples to offer guidance you may be required to change some parameters e.g.fqdn
. -
Validate a project can be created on the new Storages. -
glsh gitaly storages validate -e gstg gitaly-02a-stor-gstg.c.gitlab-gitaly-gstg-164c.internal
-
glsh gitaly storages validate -e gstg gitaly-02a-stor-gstg.c.gitlab-gitaly-gstg-380a.internal
❗ ❗ NOTE❗ ❗ Remember to replace the hostnames the values used above are examples to offer guidance. -
[For Gamedays only] Clean up
-
Collect logs from VMs to get startup-script and Chef converge times. -
Revert the MR that adds Patroni replicas with zone overrides, this is the MR that was merged at execution step 2. -
Revert the MR that adds Gitaly servers within the available zone (only revert this MR if we didn't execute to the point of no return). If we execute past the point of no return remove the instances that were stopped (in this example gitaly-02
)❗ ❗ NOTE❗ ❗ when reverting the above⬆ MR as of this gameday you need to commentatlantis approve_policies
in the MR to bypass the policies before applying withatlantis apply
. You also need to disable the all pipelines must pass condition in the MR settings. -
Revert the MR that reconfigure the regional cluster. -
Restore the weights of the affected storages.
Wrapping up
-
Re-enable canary -
In the #production Slack channel to re-enable the canary fleet run the command /chatops run canary --enable --staging
- Set server state for
gitaly-02
back to ready.-
chef-repo$ ./bin/set-server-state -z <zone {b,c,d}> b gstg ready
-
-
-
Notify the @release-managers
and@sre-oncall
that the exersice is complete. -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
It is estimated that this will take 5m to complete
-
Re-enable canary -
In the #production Slack channel to re-enable the canary fleet run the command /chatops run canary --enable --staging
- Set server state for
gitaly-02
back to ready.-
chef-repo$ ./bin/set-server-state -z <zone {b,c,d}> b gstg ready
-
-
-
Set label changeaborted /label ~change::aborted