[GPRD] Gitaly DR Gameday Zonal Restore

Production Change - Criticality 2 C2

Change Summary

This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when restoring Gitaly VMs in a single zone.

Overview

Create new Gitaly VMs from snapshots in a designated zone.
Collect timing information and verify if the restored VMs worked.
Remove any Gitaly VMs build as part of this exercise.

Execution roles and details

Role	Assignee
Change Technician	@cmcfarland
Change Reviewer	@mattmi, @ahanselka

Services Impacted - ServiceGitaly
Time tracking - 90 minutes
Downtime Component - N/A

Restoring Gitaly VMs in us-east1-d to us-east1-b/c due to an imagined outage in us-east1-d. This is only a test of restoring. No gitaly VMs will be replaced.

Planned Game Day (GPRD)

Preparation Tasks

The Week Prior

One week before the gameday make an announcement on slack production_engineering channel and request approval from @release-managers. Additionally, share the message in staging and test-platform channels.

  Next week on [DATE & TIME] we will be executing a Gitaly gameday. The process will involve moving traffic away from a single zone in gstg, and moving
  Gitaly nodes to a new zone. This should take approximately 90 minutes, the aim for this exercise is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com.
  See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17274

Notify the release managers on Slack by mentioning @release-managers and referencing this issue and await their acknowledgment.
Add an event to the GitLab Production calendar.

Day Of

Notify the release managers on Slack by mentioning @release-managers and referencing this issue and await their acknowledgment.

Notify the eoc on Slack by mentioning @sre-oncall and referencing this issue and wait for approval by adding the eoc_approved label.

Example message:

@release-managers or @sre-oncall LINK_TO_THIS_CR is scheduled for execution.
We will be taking a single Gitaly node offline for approximately 30 minutes so as to avoid dataloss.
This will only affect projects that are on that Gitaly VM, it won't be used for new projects. Kindly review and approve the CR

Share this message in the #test-platform and #staging channels on slack.

Detailed steps for the change

Change Steps

Execution

Consider starting a recording of the process now.
Note the start time in UTC in a comment to record this process duration.
Set label changein-progress /label ~change::in-progress

Use this script to create to push commits to a new branch in config-mgmt

cd runbooks/scripts/disaster-recovery
bundle
bundle exec ruby gitaly-replace-nodes.rb -e gprd -z us-east1-d --commit --push
# Where all of the Gitaly nodes in us-east1-d will be redistributed among the remaining zones.

Create merge requests in config-mgmt against the new branch created by the script, for example
- MR that adds Gitaly servers within the available zone in config-mgmt
  - Example gstg: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8788/diffs
  - Example gprd: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8631/diffs
    
    ❗❗IMPORTANT NOTE: ❗❗
    
    When testing the ability to restore production instances, this should be the only MR that gets created and merged. We cannot perform an application switch over to the new nodes during a game day in GPRD, therefore, stop at this step.
    
    In the above ⬆ example MR, only one set of nodes were provisioned, in an actual recovery scenario, there will be 3 total node blocks that need to be created.

Validate that the terraform plan is creating disks using the snapshots that were just taken in the previous step.
- Merge the config-mgmt MR to provision the new Gitaly instances if things appear correct.
- Note the time in UTC of the apply command executed to help measure provisioning time.
- ❗❗ NOTE: ❗❗ This step above ⬆ will take 10-20 minutes to complete

Validation Gitaly VMs

Wait for the instances to be built and Chef to converge.
- Confirm that chef runs have completed for new storages, it can take up to 20 minutes before they show up.
- Alternative View
- Validate if Bootstrapping is completed
  - shell: knife ssh --no-host-key-verify -C 10 "role:gprd-base-stor-gitaly" 'grep "Bootstrap finished" /var/tmp/$(ls -t /var/tmp/ | head -n 1)'
  - We could also tail bootstrap the latest logs example: tail -f /var/tmp/bootstrap-20231108-133642.log,20231108 represents the date of creation of the log file in yyyy-mm-dd format, 133642 represents time in UTC in HH-MM-SSformat.
  - Bootstrapping has completed once the machine has rebooted, and the most recent bootstrap script log from the game day has reported: Bootstrap finished
  Trouble shooting tips
  - Tail the serial output to confirm that the start up script executed successfully. gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1 the variables $project represents the gitaly project e.g gitaly-gstg-380a for the gitaly storages, $instance_name represent the instance e.g gitaly-01a-stor-gstg, and $zone represents the recovery zone e.g us-east1-c.
ssh into the Gitaly VMs:
- Ensure a separate disk has been mounted to /var/opt/gitlab
  - shell: knife ssh --no-host-key-verify -C 10 "role:gprd-base-stor-gitaly" "mount | grep /opt/gitlab"
  - shell: mount | grep /opt/gitlab
  - Mimir ❗❗NOTE❗❗ The graph above is an example to offer guidance you may be required to change some parameters e.g. fqdn.
- Execute sudo gitlab-ctl status to validate that the servers are up
Collect timing information
- Collect bootstrap timing information
  - shell: for node in $(knife node list | grep -E 'gitaly-0\da-stor-gprd'); do ./bin/find-bootstrap-duration.sh $node ; done
  - Note in a comment the results
- Collect apply and completion times for the Terraform apply
  - Note in a comment the times when the apply was triggered and when it was posted that the apply was completed

Clean up

Create and merge a MR to the config-mgmt repo, reverting the newly made Gitaly instances.
- If the gitaly node replacement ruby script was used, ensure `gitaly-recovery-nodes.tf has the original contents.
  
  ❗❗NOTE❗❗ when deleting Gitaly instances, you need to add the ~approve-policies label and comment atlantis approve_policies in the MR to bypass the policies before applying with atlantis apply.
Remove any data disk snapshots that were taken prior to restoring the nodes. These will not be removed automatically.

Wrapping up

Notify the @release-managers and @sre-oncall that the exercise is complete.

❗❗NOTE❗❗ Ensure all unused Gitaly nodes have been deleted prior to signaling completion.
Set label changecomplete /label ~change::complete
Compile the real time measurement of this process and update the Recovery Measrements Runbook.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

If, for any reason, a rollback is required, skip to the end and revert the MR to remove the VMs.

Monitoring

Key metrics to observe

Completed Chef runs: Staging | Production
Gitaly Dashboard: Staging | Production
Gitaly RPS by FQDN: Staging | Production
Gitaly Errors by FQDN: Staging | Production

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed upon with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results are noted in a comment on this issue.
- A dry-run has been conducted and results are noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed before the change is rolled out. (In the #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In the #production channel, mention @release-managers and this issue and await their acknowledgement.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Oct 21, 2024 by Cameron McFarland