Staging: FY26-Q3 Gitaly DR Gameday (Friday 2025-10-17)
Change Summary
This is a gameday change request to test restoring Gitaly VMs in a single zone.
(The production outage section has been removed)
Change Details
- Services Impacted - ServiceGitaly
- Change Technician - xx
- Change Reviewer - xx
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-xx-xx xx:00
- Time tracking - 90 minutes
- Downtime Component - 30 minutes
Provide a brief summary indicating the affected zone
Gitaly hosts are currently in us-east1-c
Important
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Preparation
Note
The following checklists must be done in advance, before setting the label changescheduled
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
The Change Criticality has been set appropriately and requirements have been reviewed. -
The change plan is technically accurate.- Unknown. The purpose of this exercise is to validate the plan.
-
The rollback plan is technically accurate and detailed enough to be executed by anyone with access.- Unknown. The purpose of this exercise is to validate the plan.
-
This Change Issue is linked to the appropriate Issue and/or Epic -
Change has been tested in staging and results noted in a comment on this issue.- This is a test in staging
-
A dry-run has been conducted and results noted in a comment on this issue. Not applicable -
The change execution window respects the Production Change Lock periods. -
Once all boxes above are checked, mark the change request as scheduled: /label ~"change::scheduled" -
For C1 and C2 change issues, the change event is added to the GitLab Production calendar by the change-scheduler bot. It is schedule to run every 2 hours. -
For C1 change issues, a Senior Infrastructure Manager has provided approval with the manager_approved label on the issue. -
For C2 change issues, an Infrastructure Manager provided approval with the manager_approved label on the issue. -
Mention @gitlab-org/saas-platforms/inframanagersin this issue to request approval and provide visibility to all infrastructure managers. -
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #productionchannel, mention@release-managersand this issue and await their acknowledgment.)
Preparation Tasks
The Week Prior
-
One week before the gameday make an announcement on slack #f_gamedays, copy the link to the production_engineering channel and request approval from @release-managers. Consider also sharing this post in the appropriate environment channels; staging or production.Next week on [DATE & TIME] we will be executing a Gitaly gameday. The process will involve moving traffic away from a single zone in gstg, and moving Gitaly nodes to a new zone. This should take approximately 90 minutes, the aim for this exercise is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17274 -
Request approval from the Infrastructure manager, wait for approval and confirm by the manager_approved label. -
Notify the release managers on Slack by mentioning @release-managersand referencing this issue and await their acknowledgment. -
Add an event to the GitLab Production calendar. -
Ensure you have PAT, you would require admin API access , please select api,read_apiandadmin_mode scopes, also validate if you have SSH key setup for staging. -
Confirm that you can run
knifefrom yourchef-repo. You can use to verify the setupknife status
Day Of
-
Notify the release managers on Slack by mentioning @release-managersand referencing this issue and await their acknowledgment. -
Notify the eoc on Slack by mentioning @sre-oncalland referencing this issue and wait for approval by adding the eoc_approved label.- Example message:
@release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution. We will be taking a single Gitaly node offline in staging `gstg` for approximately 30 minutes so as to avoid dataloss. This will only affect projects that are on that Gitaly VM, it won't be used for new projects. Kindly review and approve the CR -
Share this message in the #test-platform and #staging channels on slack.
Detailed steps for the change
Change Steps
Execution
-
Consider starting a recording of the process now. -
Note the start time in UTC in a comment to record this process duration. -
Set label changein-progress /label ~change::in-progress -
Identify the Gitaly server to run the checksum command -
<name>-stor-gstg.c.gitlab-gitaly-gstg-164c.internalis likely to be the storage name for a host, you can cross-reference the rails configuration for validation , additionally assuming we know which zone is affected , we can navigate to Repository to confirm.
-
-
Get a list of Gitaly projects by running the following
gcloud projects list --filter="name:gitlab-gitaly-<environment>*" -
Note down the md5sumof a repository on the affected storage in a comment on this issue.-
glsh gitaly repositories checksum -s gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -e gstg -g gitlab-com -
replace
gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internalwith the storage name (not necessarily fqdn) of one of the impacted storage. -
Note that there are multiple Gitaly projects in
gstgandgprd, in the above example we are calculating themd5sumof the one of the impacted storage ie: from thegitlab-gitaly-gstg-164cproject.
-
-
Use this script to create and push commits to a new branch in config-mgmt,k8s-workloads/gitlab-comandchef-repo. This script creates new commits with the necessary changes in thetmpdirectory.cd runbooks/scripts/disaster-recovery bundle bundle exec ruby gitaly-replace-nodes.rb -e gstg -z <impacted zone> --app-config --commit --push # Where all of the Gitaly nodes in us-east1-d will be redistributed among the remaining zones.- You can use the
-wparameter to pass a path where the repositories with be cloned.
cd runbooks/scripts/disaster-recovery bundle bundle exec ruby gitaly-replace-nodes.rb -e gstg -z <impacted zone> --app-config --commit --push -w temp_dir # Where all of the Gitaly nodes in us-east1-d will be redistributed among the remaining zones. - You can use the
-
Create merge requests in config-mgmt,k8s-workloads/gitlab-comandchef-repoagainst the new branches created by the script, for example. Since the commits have already been pushed, navigating to the repositories will give you an option to create the MR from the UI.-
MR that adds Gitaly servers within the available zone in config-mgmt. Confirm that the os_disk_snapshot_search_stringin the generated MR is correct. Use this glcoud command to confirm that the data snapshot exists and if not update the MR with the correct one.gcloud compute snapshots list --project <gitlab-gitaly-gstg-380a> --sort-by=~creationTimestamp --format="table[box,margin=3,title='Most recent snapshot'](name,creationTimestamp,sourceDisk,status)"-
Example
gstg: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8788/diffs -
Example
gprd: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8631/diffs❗ ❗ IMPORTANT NOTE:❗ ❗ When testing the ability to restore production instances, this should be the only MR that gets created and merged. We cannot perform an application switch over to the new nodes during a game day in GPRD, therefore, stop at this step.
In the above
⬆️ example MR, only one set of nodes were provisioned, in an actual recovery scenario, there will be 3 total node blocks that need to be created.
-
-
MR to update application configuration in k8s-workloads/gitlab-com -
MR to update application configuration in chef-repo
-
-
Set the weights to zero on the affected storages. The lists of storage can be found from the MR in the step above - With an admin account, navigate to Repository Storage Settings and set the weight to 0 for the affected storages that correspond to the instances being replaced.
-
Stop the Gitaly nodes and create new snapshots. (Since we currently have two Gitaly projects in gstg)❗ ❗ NOTE:❗ ❗ The zone, instance and disk names in these commands are examples. They need to be modified to match the values corresponding to the instances you are replacing. Ensure you remove the place holders<>when you update the correct names.We are only taking snapshots for the data disk as that's where the customer data lives. The other disks are ephemeral and are managed by Chef and can easily be recreated.
gcloud compute instances stop --project=gitlab-gitaly-gstg-380a --zone="<us-east1-b>" "<gitaly-01-stor-gstg>" gcloud compute snapshots create "file-gitaly-01-gameday-snapshot" --source-disk="<gitaly-01-stor-gstg-data>" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17189" --project=gitlab-gitaly-gstg-380a --source-disk-zone="us-east1-b" gcloud compute instances stop --project=gitlab-gitaly-gstg-164c --zone="us-east1-b" "<gitaly-01-stor-gstg>" gcloud compute snapshots create "file-gitaly-01-gameday-snapshot" --source-disk="<gitaly-01-stor-gstg-data>" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17189" --project=gitlab-gitaly-gstg-164c --source-disk-zone="us-east1-b" -
Verify that the repositories on the storage are no longer available and paste the output as a comment. The same glsh gitaly repositories checksumcommand from above should exit with an error similar to this one.glsh gitaly repositories checksum -s gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -e gstg -g gitlab-com ✘ curl exited non-zero! curl: (22) The requested URL returned error: 500 {"message":"500 Internal Server Error"} -
Check and validate that the terraform plan
config-mgmtMRs Atlantis run is creating disks using the snapshots that were just taken in the previous step.-
Rerun the terraform plan and verify the snapshots are correct -
Review the plan and identify the source snapshot for the Data disks and that the snapshots are the snapshots created earlier in this process. -
❗ ❗ NOTE:❗ ❗ There might be scheduled snapshots taken after the manually created snapshots which will probably be 0B in size and might be in the Terraform plan instead of our manually created snapshot.
-
-
Merge the [config-mgmt MR][config-mgmt] to provision the new Gitaly instances if things appear correct. -
❗ ❗ NOTE:❗ ❗ This step above⬆️ will take 10-20 minutes to complete
-
Validation Gitaly VMs
-
Wait for the instances to be built and Chef to converge. -
Confirm that chef runs have completed for new storages, it can take up to 20 minutes before they show up. -
Validate if Bootstrapping is completed
- We could also tail bootstrap the latest logs example:
tail -f /var/tmp/bootstrap-20231108-133642.log,20231108represents the date of creation of the log file inyyyy-mm-ddformat,133642represents time in UTC inHH-MM-SSformat. - Bootstrapping is completed once the machine has rebooted, and the most recent bootstrap script log from the game day has reported:
Bootstrap finished
Trouble shooting tips
-
Tail the serial output to confirm that the start-up script executed successfully.
gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1`
the variables
$projectrepresents the gitaly project e.ggitaly-gstg-380afor the gitaly storages,$instance_namerepresent the instance e.ggitaly-01a-stor-gstg, and$zonerepresents the recovery zone e.gus-east1-c. - We could also tail bootstrap the latest logs example:
-
-
ssh into the Gitaly VMs: -
Ensure a separate disk has been mounted to
/var/opt/gitlab- shell:
knife ssh --no-host-key-verify -C 10 "role:gprd-base-stor-gitaly" "mount | grep /opt/gitlab" - shell:
mount | grep /opt/gitlab -
Mimir
❗ ❗ NOTE❗ ❗ The graph above is an example to offer guidance you may be required to change some parameters e.g.fqdn.
- shell:
-
Execute
sudo gitlab-ctl statusto validate that the servers are up
-
POINT OF NO RETURN
Add the new Storage to the Rails and Gitaly configuration
-
Merge the k8s-workload/gitlab-com[MR][gitlab-com] -
Merge the chef-repo[MR][chef-repo] -
Once the chef-repopipeline succeeds, force achef-clientrun across the entire Gitaly fleet.knife ssh -t 3 -C 10 'chef_environment:gstg AND (name:gitaly* OR name:file-hdd*)' 'sudo chef-client'
Validate that the new nodes are receiving traffic
-
❗ ❗ NOTE❗ ❗ The dashboards above are examples to offer guidance you may be required to change some parameters e.g.fqdn. -
Validate a project can be created on the new Storage. -
glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -
glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal
❗ ❗ NOTE❗ ❗ Please note it may take several minutes for validation commands to give reliable output. -
-
Validate that the repository we previously collected an md5sumfrom is available, and the checksums match. Copy the md5sum line as a comment in this issue.glsh gitaly repositories checksum -s gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -e gstg -g gitlab-com
❗ ❗ NOTE❗ ❗ Remember to replace the storage name with the one that is having hosts migrated for. The values used above are examples to offer guidance. -
Validate that the number of errors returned by the application are at a nominal level. -
Restore the weights of the affected storages. -
Note the conclusion time in UTC in a comment to record this process duration. -
Collect timing information -
Collect bootstrap timing information - shell:
for node in $(knife node list | grep -E 'gitaly-0\da-stor-gstg'); do ./bin/find-bootstrap-duration.sh $node ; done - Note in a comment the results
- shell:
-
Collect apply and completion times for the Terraform apply - Note in a comment the times when the apply was triggered and when it was posted that the apply was completed
-
Clean up
-
Create and merge the MR to the config-mgmt repo, replacing the now unused Gitaly instances with our newly created Gitaly nodes in gitaly-multi-project.tf. If the Gitaly node replacement ruby script was used, ensuregitaly-recovery-nodes.tfhas the original contents. This needs to be done in two steps:-
Remove deletion protection from the existing nodes. - Example MR
👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/9251 - Ensure that the
data_disk_snapshot_search_stringattribute is also removed from the newly created instances.
- Example MR
-
Second MR to remove the VMs - Example MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/9252
❗ ❗ NOTE❗ ❗ When deleting Gitaly instances, you need to add the ~approve-policies label and commentatlantis approve_policiesin the MR to bypass the policies before applying withatlantis apply.
- Example MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/9252
-
-
Remove any data disk snapshots that were taken prior to restoring the nodes. These will not be removed automatically. -
Verify details about the snapshots
gcloud compute snapshots describe [SNAPSHOT_NAME] -
Delete the snapshots
gcloud compute snapshots delete [SNAPSHOT_NAME]❗ ❗ Warning:❗ ❗ Deleting a snapshot is irreversible. You can't recover a deleted snapshot. -
Wrapping up
-
Notify the @release-managersand@sre-oncallthat the exercise is complete.❗ ❗ NOTE❗ ❗ Ensure all unused Gitaly nodes have been deleted before signalling completion. -
Set label changecomplete /label ~change::complete -
Compile the real-time measurement of this process and update the Recovery Measurements Runbook.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
If the point of no return has not been reached, and no impactful changes have been applied:
-
Notify @release-managersand@sre-oncallthat the exercise is being aborted. -
Set label changeaborted using /label ~change::aborted. -
Close the MRs without merging them -
Start the VMs again -
Restore storage weights back to their previous values -
Document the reason for aborting in the issue, including any observations or anomalies. -
No further action is required — the change is safely aborted.
If the point of no return has been reached, and changes have already been applied:
-
Notify @release-managersand@sre-oncallthat the rollback process is starting. -
Set label changeaborted using /label ~change::aborted. -
Follow the cleanup and recovery steps outlined above to revert or stabilize the environment. -
Document the reason for rollback in the issue, and provide relevant logs or output.
Monitoring
Key metrics to observe
- Completed Chef runs: Staging | Production
- Gitaly Dashboard: Staging | Production
- Gitaly RPS by FQDN: Staging | Production
- Gitaly Errors by FQDN: Staging | Production