[GSTG] Zonal Gitaly Restore/Test Gameday
C2
Production Change - Criticality 2Change Summary
This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when restoring Gitaly VMs in a single zone.
Gameday execution roles and details
Role | Assignee |
---|---|
Change Technician | @thisisshreya |
Change Reviewer | @mattmi / @cmcfarland |
- Services Impacted - ServiceGitaly
- Time tracking - 90 minutes
- Downtime Component - 30 minutes
During this game day, we'll be restoring a gitaly nodes in us-east1-c to us-east1-d (replacing gitaly-01a)
Production Outage
Perform these steps in the event of an outage in production
Preparation Tasks
-
Notify the release managers on Slack by mentioning @release-managers
and referencing this issue and await their acknowledgment. -
Notify the EOC on Slack by mentioning @sre-oncall
and referencing this issue and wait for approval by adding the eoc_approved label. -
Post a notification in the production Slack channel -
Ensure all merge requests have been rebased if necessary and approved.
Detailed steps for the change
Change Steps
Execution
-
Set label changein-progress /label ~change::in-progress
-
Note the start time in UTC in a comment to record this process duration. -
Prepare merge requests ❗ ❗ NOTE❗ ❗ Merge requests should target a single zonal outage recovery branch, to be determined and created beforehand. NOT master-
MR that adds Gitaly servers within the available zone in config-mgmt -
MR to update application configuration in k8s-workloads/gitlab-com -
MR to update application configuration in chef-repo
-
-
Merge the config-mgmt MR to provision the new Gitaly instances into the zonal outage recovery branch. -
Wait for the main zonal outage recovery branch to be merged into master.
Validation Gitaly VMs
-
Wait for the instances to be built and Chef to converge. -
Confirm that chef runs have completed for new storages, it can take up to 20 minutes before they show up.
Trouble shooting tips
- Tail the serial output to confirm that the start up script executed successfully.
gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1
the variables$project
represents the gitaly project e.ggitlab-gitaly-gprd-380a
for the gitaly storages,$instance_name
represent the instance e.ggitaly-01a-stor-gstg
, and$zone
represents the recovery zone e.gus-east1-c
. - We could also tail bootstrap logs example:
tail -f /var/tmp/bootstrap*.log
- Bootstrapping has completed once the machine has rebooted, and the most recent bootstrap script log from the game day has reported:
Bootstrap finished
- Tail the serial output to confirm that the start up script executed successfully.
-
-
ssh into the Gitaly VMs: - Ensure a separate disk has been mounted to /var/opt/gitlab
- Execute
sudo gitlab-ctl status
to validate that the servers are up - Validate that the data disk is properly mounted:
-
shell:
mount | grep /opt/gitlab
-
❗ ❗ NOTE❗ ❗ The graph above is an example to offer guidance you may be required to change some parameters e.g.fqdn
.
-
POINT OF NO RETURN
Add the new Storages to the Rails and Gitaly configuration
-
Merge the k8s-workload/gitlab-com MR -
Merge the chef-repo MR -
Once the chef-repo pipeline succeeds, force a chef-client run across the entire Gitaly fleet. knife ssh -t 3 -C 50 'chef_environment:gprd AND (name:gitaly* OR name:file-hdd*)' 'sudo chef-client'
Validate that the new nodes are receiving traffic
-
❗ ❗ NOTE❗ ❗ The dashboards above are examples to offer guidance you may be required to change some parameters e.g.fqdn
. -
Validate a project can be created on the new Storages. -
glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-164c.internal
-
glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal
❗ ❗ NOTE❗ ❗ Remember to replace the storage name with the one that is having hosts migrated for. The values used above are examples to offer guidance. -
-
Note the conclusion time in UTC in a comment to record this process duration.
Wrapping Up
-
Compile the real time measurement of this process and update the Recovery Measrements Runbook.
Planned Game Day (GSTG)
Preparation Tasks
The Week Prior
-
One week before the gameday make an announcement on slack production_engineering channel and request approval from @release-managers
. Additionally, share the message in staging and test-platform channels.Next week on [DATE & TIME] we will be executing a Gitaly gameday. The process will involve moving traffic away from a single zone in gstg, and moving Gitaly nodes to a new zone. This should take approximately 90 minutes, the aim for this exercise is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17274
-
Notify the release managers on Slack by mentioning @release-managers
and referencing this issue and await their acknowledgment. -
Add an event to the GitLab Production calendar.
Day Of
-
Notify the release managers on Slack by mentioning @release-managers
and referencing this issue and await their acknowledgment. -
Notify the eoc on Slack by mentioning @sre-oncall
and referencing this issue and wait for approval by adding the eoc_approved label.- Example message:
@release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution. We will be taking a single Gitaly node offline for approximately 30 minutes so as to avoid dataloss. This will only affect projects that are on that Gitaly VM, it won't be used for new projects. Kindly review and approve the CR
-
Share this message in the #test-platform and #staging channels on slack.
Detailed steps for the change
Change Steps
Execution
-
Consider starting a recording of the process now. -
Note the start time in UTC in a comment to record this process duration. -
Set label changein-progress /label ~change::in-progress
-
Note down the md5sum
of a repository on the affected storage in a comment on this issue.glsh gitaly repositories checksum -s gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -e gstg -g gitlab-com
- replace gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal with the storage name (not necessarily fqdn) of one of the impacted storage.
-
Prepare merge requests -
MR that adds Gitaly servers within the available zone in config-mgmt 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/9096-
Example
gstg
: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8788/diffs -
Example
gprd
: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8631/diffs❗ ❗ IMPORTANT NOTE:❗ ❗ When testing the ability to restore production instances, this should be the only MR that gets created and merged. We cannot perform an application switch over to the new nodes during a game day in GPRD, therefore, stop at this step.
In the above
⬆ example MR, only one set of nodes were provisioned, in an actual recovery scenario, there will be 3 total node blocks that need to be created.</details>
-
-
MR to update application configuration in k8s-workloads/gitlab-com 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!3821 (merged) -
MR to update application configuration in chef-repo 👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4986
-
-
Set the weights to zero on the affected storages. - With an admin account, navigate to Repository Storage Settings and set the weight to 0 for the affected storages that correspond to the instances being replaced.
-
Stop the Gitaly nodes and create new snapshots. -
gcloud compute instances stop --project=gitlab-gitaly-gstg-380a --zone="us-east1-b" "gitaly-01-stor-gstg"
-
gcloud compute snapshots create "file-gitaly-01-gameday-snapshot" --source-disk="gitaly-01-stor-gstg-data" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17189" --project=gitlab-gitaly-gstg-380a --source-disk-zone="us-east1-b"
-
gcloud compute instances stop --project=gitlab-gitaly-gstg-164c --zone="us-east1-b" "gitaly-01-stor-gstg"
-
gcloud compute snapshots create "file-gitaly-01-gameday-snapshot" --source-disk="gitaly-01-stor-gstg-data" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17189" --project=gitlab-gitaly-gstg-164c --source-disk-zone="us-east1-b"
❗ ❗ NOTE:❗ ❗ The zone, instance and disk names in these commands are examples. They need to be modified to match the values corresponding to the instances you are replacing.
-
-
Verify that the repositories on the storage are no longer available and paste the output as a comment. The same glsh gitaly repositories checksum
command from above should exit with an error.glsh gitaly repositories checksum -s gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -e gstg -g gitlab-com
-
Merge the config-mgmt MR to provision the new Gitaly instances. -
Validate that the terraform plan is creating disks using the snapshots that were just taken in the previous step. -
❗ ❗ NOTE:❗ ❗ This step above⬆ will take 10-20 minutes to complete
-
Validation Gitaly VMs
-
Wait for the instances to be built and Chef to converge. -
Confirm that chef runs have completed for new storages, it can take up to 20 minutes before they show up.
Trouble shooting tips
- Tail the serial output to confirm that the start up script executed successfully.
gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1
the variables$project
represents the gitaly project e.ggitaly-gstg-380a
for the gitaly storages,$instance_name
represent the instance e.ggitaly-01a-stor-gstg
, and$zone
represents the recovery zone e.gus-east1-c
. - We could also tail bootstrap logs example:
tail -f /var/tmp/bootstrap-20231108-133642.log
- Bootstrapping has completed once the machine has rebooted, and the most recent bootstrap script log from the game day has reported:
Bootstrap finished
- Tail the serial output to confirm that the start up script executed successfully.
-
-
ssh into the Gitaly VMs: - Ensure a separate disk has been mounted to
/var/opt/gitlab
- Execute
sudo gitlab-ctl status
to validate that the servers are up - Validate that the data disk is properly mounted:
-
shell:
mount | grep /opt/gitlab
-
❗ ❗ NOTE❗ ❗ The graph above is an example to offer guidance you may be required to change some parameters e.g.fqdn
.
-
- Ensure a separate disk has been mounted to
POINT OF NO RETURN
Add the new Storages to the Rails and Gitaly configuration
-
Merge the k8s-workload/gitlab-com MR -
Merge the chef-repo MR -
Once the chef-repo pipeline succeeds, force a chef-client run across the entire Gitaly fleet. knife ssh -t 3 -C 10 'chef_environment:gstg AND (name:gitaly* OR name:file-hdd*)' 'sudo chef-client'
Validate that the new nodes are receiving traffic
-
❗ ❗ NOTE❗ ❗ The dashboards above are examples to offer guidance you may be required to change some parameters e.g.fqdn
. -
Validate a project can be created on the new Storages. -
glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-164c.internal
-
glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal
-
-
Validate the repository we previously collected an md5sum
from is available, and the checksums match. Copy the md5sum line as a comment in this issue.glsh gitaly repositories checksum -s gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal -e gstg -g gitlab-com
❗ ❗ NOTE❗ ❗ Remember to replace the storage name with the one that is having hosts migrated for. The values used above are examples to offer guidance. -
Validate that the number of errors returned by the application are at a nominal level. -
Restore the weights of the affected storages. -
Note the conclusion time in UTC in a comment to record this process duration.
Clean up
-
Create and merge a MR to the config-mgmt repo, removing the now unused Gitaly instances ❗ ❗ NOTE❗ ❗ when deleting Gitaly instances, you need to add the ~approve-policies label and commentatlantis approve_policies
in the MR to bypass the policies before applying withatlantis apply
. -
Remove any data disk snapshots that were taken prior to restoring the nodes. These will not be removed automatically.
Wrapping up
-
Notify the @release-managers
and@sre-oncall
that the exercise is complete.❗ ❗ NOTE❗ ❗ Ensure all unused Gitaly nodes have been deleted prior to signaling completion. -
Set label changecomplete /label ~change::complete
-
Compile the real time measurement of this process and update the Recovery Measrements Runbook.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
If the point of no return has not been passed:
Monitoring
Key metrics to observe
- Completed Chef runs: Staging | Production
- Gitaly Dashboard: Staging | Production
- Gitaly RPS by FQDN: Staging | Production
- Gitaly Errors by FQDN: Staging | Production
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed upon with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results are noted in a comment on this issue.
- A dry-run has been conducted and results are noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed before the change is rolled out. (In the #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In the #production channel, mention
@release-managers
and this issue and await their acknowledgement.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.