[gstg] Gitaly Zonal Outage Game Day

Production Change - Criticality 2 C2

Change Summary

This production issue is to be used for Gamedays as well as recovery in case of a zonal outage. It outlines the steps to be followed when restoring Gitaly VMs in a single zone.

Gameday execution roles and details

Role	Assignee
Change Technician	@ahanselka
Change Technician II	@cmcfarland

Services Impacted - ServiceGitaly
Time tracking - 90 minutes
Downtime Component - 30 minutes

Provide a brief summary indicating the affected zone

Production Outage

Perform these steps in the event of an outage in production

❗NOTE❗❗ These steps are intended to be run in the event that there is a zonal outage, and the set of Gitaly VMs in that zone are completely unavailable.

Preparation Tasks

Prepare merge requests
- MR that adds Gitaly servers within the available zone in config-mgmt
  - Example: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/7657
- MR to update application configuration in k8s-workloads/gitlab-com
  - Example: gitlab-com/gl-infra/k8s-workloads/gitlab-com!3260 (merged)
- MR to update application configuration in chef-repo
  - Example: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4242
Notify the release managers on Slack by mentioning @release-managers and referencing this issue and await their acknowledgment.
Notify the EOC on Slack by mentioning @sre-oncall and referencing this issue and wait for approval by adding the eoc_approved label.
Post a notification in the production Slack channel
Ensure all merge requests have been rebased if necessary and approved.

Detailed steps for the change

Change Steps

Execution

Set label changein-progress /label ~change::in-progress
Merge the config-mgmt MR to provision the new Gitaly instances.

Validation Gitaly VMs

Wait for the instances to be built and Chef to converge.
- Confirm that chef runs have completed for new storages, it can take up to 30 minutes before they show up.
  Trouble shooting tips
  - Tail the serial output to confirm that the start up script executed successfully. gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1 the variables $project represents the gitaly project e.g gitlab-gitaly-gprd-380a for the gitaly storages, $instance_name represent the instance e.g gitaly-01a-stor-gstg, and $zone represents the recovery zone e.g us-east1-c.
  - We could also tail bootstrap logs example: tail -f /var/tmp/bootstrap*.log
ssh into the Gitaly VMs:
- Execute sudo gitlab-ctl status to validate that the servers are up
- Validate that the data disk is properly mounted:
  - shell: mount | grep /opt/gitlab
  - Mimir
    
    ❗❗NOTE❗❗ The graph above is an example to offer guidance you may be required to change some parameters e.g. fqdn.

POINT OF NO RETURN

Add the new Storages to the Rails and Gitaly configuration

Merge the k8s-workload/gitlab-com MR
Merge the chef-repo MR

Validate that the new nodes are receiving traffic

https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg&var-fqdn=gitaly-01a-stor-gstg.c.gitlab-gitaly-gstg-164c.internal&viewPanel=1687078843
https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg&var-fqdn=gitaly-01a-stor-gstg.c.gitlab-gitaly-gstg-380a.internal&viewPanel=1687078843

❗❗NOTE❗❗ The dashboards above are examples to offer guidance you may be required to change some parameters e.g. fqdn.
Validate a project can be created on the new Storages.
- glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-164c.internal
- glsh gitaly storages validate -e gstg gitaly-01-stor-gstg.c.gitlab-gitaly-gstg-380a.internal
❗❗NOTE❗❗ Remember to replace the hostnames the values used above are examples to offer guidance.

Planned Game Day (GSTG)

Preparation Tasks

The Week Prior

One week before the gameday make an announcement on slack production_engineering channel.

  Next week during the [Reliability discussions and firedrills](https://calendar.google.com/calendar/u/0/r/day/2023/12/20) meeting,
  we will be executing our quarterly gameday. The process will involve moving traffic away from a single zone in gstg, and moving 
  Gitaly nodes to a new zone. This should take approximately 90 minutes, the aim
  for this exercise is to test our disaster recovery capabilities and measure if we are still within our RTO & RPO targets set by the 
  [DR working group](https://handbook.gitlab.com/handbook/company/working-groups/disaster-recovery/) for GitLab.com.
  See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18091

Prepare merge requests
- MR that adds Gitaly servers within the available zone in config-mgmt
  - https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8557
    ❗❗IMPORTANT NOTE: ❗❗
    
    When testing the ability to restore production instances, this should be the only MR that gets created and merged. We cannot perform an application switch over to the new nodes during a game day in GPRD, therefore, stop at this step.
    
    In the above ⬆ example MR we did not edit the labels, this caused an incident which resulted in Deployment Failures. To avoid this remove the labels or add an override.
    labels = merge(var.labels["gitaly"], { type = "", stage = "", environment = "" }
- MR to update application configuration in k8s-workloads/gitlab-com
  - gitlab-com/gl-infra/k8s-workloads/gitlab-com!3625 (merged)
- MR to update application configuration in chef-repo
  - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4817

Day Of

Notify the release managers on Slack by mentioning @release-managers and referencing this issue and await their acknowledgment.

Notify the eoc on Slack by mentioning @sre-oncall and referencing this issue and wait for approval by adding the eoc_approved label.

Example message:

@release-managers or @sre-oncall https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18091 is scheduled for execution. 
We will be taking a single Gitaly node offline for approximately 30 minutes so as to avoid dataloss. 
This will only affect projects that are on that Gitaly VM, it won't be used for new projects.

Post a similar message to the #test-platform and #staging channels on slack.
Ensure all merge requests have been rebased if necessary and approved.

Detailed steps for the change

Change Steps

Execution

Consider starting a recording of the process now.
Set label changein-progress /label ~change::in-progress
Set the weights to zero on the affected storages.
- With an admin account, navigate to Repository Storage Settings and set the weight to 0 for the affected storages.
Stop the Gitaly nodes and create new snapshots.
- gcloud compute instances stop --project=gitlab-gitaly-gstg-380a --zone="us-east1-b" "gitaly-02a-stor-gstg"
- gcloud compute snapshots create "file-gitaly-02-gameday-snapshot-20240610" --source-disk="gitaly-02a-stor-gstg-data" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18091" --project=gitlab-gitaly-gstg-380a --source-disk-zone="us-east1-b"
- gcloud compute instances stop --project=gitlab-gitaly-gstg-164c --zone="us-east1-b" "gitaly-02a-stor-gstg"
- gcloud compute snapshots create "file-gitaly-02-gameday-snapshot-20240610" --source-disk="gitaly-02a-stor-gstg-data" --description="Part of https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18091" --project=gitlab-gitaly-gstg-164c --source-disk-zone="us-east1-b"
Merge the config-mgmt MR to provision the new Gitaly instances.
- ❗❗ NOTE: ❗❗ This step above ⬆ can take up to 20 minutes to complete

Validation Gitaly VMs

Wait for the instances to be built and Chef to converge.
- Confirm that chef runs have completed for new storages, it can take upto 30 minutes before they show up.
  Trouble shooting tips
  - Tail the serial output to confirm that the start up script executed successfully. gcloud compute --project=$project instances tail-serial-port-output $instance_name --zone=$zone --port=1 the variables $project represents the gitaly project e.g gitaly-gstg-380a for the gitaly storages, $instance_name represent the instance e.g gitaly-01a-stor-gstg, and $zone represents the recovery zone e.g us-east1-c.
  - We could also tail bootstrap logs example: tail -f /var/tmp/bootstrap-20231108-133642.log
ssh into the Gitaly VMs:
- Execute sudo gitlab-ctl status to validate that the servers are up
- Validate that the data disk is properly mounted:
  - shell: mount | grep /opt/gitlab
  - Mimir
    
    ❗❗NOTE❗❗ The graph above is an example to offer guidance you may be required to change some parameters e.g. fqdn.

POINT OF NO RETURN

Add the new Storages to the Rails and Gitaly configuration

Merge the k8s-workload/gitlab-com MR
Merge the chef-repo MR

Validate that the new nodes are receiving traffic

https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg&var-fqdn=gitaly-02b-stor-gstg.c.gitlab-gitaly-gstg-164c.internal&viewPanel=1687078843
https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg&var-fqdn=gitaly-02b-stor-gstg.c.gitlab-gitaly-gstg-380a.internal&viewPanel=1687078843

❗❗NOTE❗❗ The dashboards above are examples to offer guidance you may be required to change some parameters e.g. fqdn.
Validate a project can be created on the new Storages.
- glsh gitaly storages validate -e gstg gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-164c.internal
- glsh gitaly storages validate -e gstg gitaly-02-stor-gstg.c.gitlab-gitaly-gstg-380a.internal
❗❗NOTE❗❗ Remember to replace the hostnames the values used above are examples to offer guidance.

Clean up

Create and merge a MR to the config-mgmt repo, removing the now unused Gitaly instances

❗❗NOTE❗❗ when deleting Gitaly instances, you need to comment atlantis approve_policies in the MR to bypass the policies before applying with atlantis apply. You also need to add the MR number to the CONFTEST_APPROVE_POLICIES_MR CI/CD variable, and re-run the pipeline so that the conftest job succeeds.
Restore the weights of the affected storages.

Wrapping up

Notify the @release-managers and @sre-oncall that the exercise is complete.

❗❗NOTE❗❗ Ensure all unused Gitaly nodes have been deleted prior to signaling completion.
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

If the point of no return has not been passed:

Monitoring

Key metrics to observe

Completed Chef runs: Staging | Production
Gitaly Dashboard: Staging | Production
Gitaly RPS by FQDN: Staging | Production
Gitaly Errors by FQDN: Staging | Production

Edited Jun 10, 2024 by Alex Hanselka