Move Ops instance to Kubernetes in us-central region

Production Change

Change Summary

We are moving the ops instance from us-east (where production is) to us-central in order to provide some geographical redundancy in case of a problem in us-east. At the same time, we are migrating the ops instance from a single VM without redundancy to a Kubernetes deployment.

Since we can't have two instances running at the same time with the same data due to all of the automation that is constantly running, we cannot properly test an export/import migration. There is also the problem of having two independent instances managing data in the object storage buckets. The only way to test whether all functions of the Kubernetes instance work the same as the VM instance is to run them at the same time and compare. To accomplish this, we can run the Kubernetes instance pointed at same postgres, gitaly and redis as the VM, without interrupting normal operation of the VM instance. This will effectively make it behave like one instance with multiple copies of most components. Once all of the configuration and functionality is validated, we can switch over by moving those services to the already configured Kubernetes instance.

Theoretically, we should be moving the VM instance to us-central before connecting things in us-central to it. This would avoid sending latency sensitive traffic between sites and incurring the ~30ms latency penalty. Since we haven't been able to find anything definitive saying that things will definitely break if we do this, we are going to proceed without moving the VM first. There will be no real user traffic running cross site, so any failures will be limited to the tests. If any of those tests fail due to the added latency, we will proceed with the backup plan of moving the VM.

Once both instances are running together and functioning the same, we will cut over by making the us-central copies of the database, redis and gitaly the master versions, and we will point the DNS to the us-central instance's ingress.

Change Details

Services Impacted - Ops Instance
Change Technician - @ayeung
Change Reviewer - @devin @gsgl
Time tracking - Time, in minutes, needed to execute all change steps, including rollback
Downtime Component - YES (when cutting over to k8s)

Detailed steps for the change

Pre Change Steps - steps to prepare for the change

Spin down and remove standalone CloudSQL instance in us-central
Create CloudSQL replica in us-central. This will be done by hand since the Terraform module does not support multi-region replicas (to be imported into TF later)
Verify that we have no firewall or network customizations in us-east that are not in us-central
Execute CR to open Gitaly and Redis ports on the VM: #8599 (closed)
Copy config of the existing instance to helm values - helm template should not error
Copy secrets from config file to vault and set up plumbing to get it into the helm values

Connect and Transition new instance

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Ping @release-managers in Slack at 12:00 UTC time on Friday 28th April about disabling auto deploys: /chatops run auto_deploy pause. See Slack thread: https://gitlab.slack.com/archives/C8PKBH3M5/p1682683200999699

Take a snapshot of the boot & data disks:

boot disk:

gcloud compute snapshots create boot-disk-ops-gitlab-net-pre-k8s-migration --project=gitlab-ops --source-disk=gitlab-01-inf-ops --source-disk-zone=us-east1-c --guest-flush --storage-location=us-east1

data disk:

gcloud compute snapshots create data-disk-ops-gitlab-net-pre-k8s-migration --project=gitlab-ops --source-disk=gitlab-01-inf-ops-data --source-disk-zone=us-east1-c --guest-flush --storage-location=us-east1

Take an on-demand backup of the DB: https://console.cloud.google.com/sql/instances/gitlab-ops-0f01/backups?project=gitlab-ops
Set label changein-progress /label ~"blocks deployments"
Set label changein-progress /label ~"blocks feature-flags"
Set label changein-progress /label ~change::in-progress
Verify one final time that both instances are still running the same version
Enable Gitaly on k8s and configure to use its own Gitaly: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2046 (merged)
Manually modify the ops-gitlab-gitaly ConfigMap to use both the internal pod, and the VM for Gitaly. Something like the following should be added under the templating for storages (this is not supported by the helm chart, but we only need it while initially moving the data, so it's ok if it gets overwritten when we are finished):
```
[[storage]]
name = "ops-central"
path = "/home/git/repositories"
```
So there should be 2 storages listed when viewing the list of Gitaly Servers on the cluster side: default, which will point to port 8075 of the Gitaly pod, and ops-central, which will point to the exposed IP:port of the Gitaly service in the cluster. In practice they're the same thing.

This step will need to be done multiple times, as the ConfigMap will be overwritten every time a Gitaly config change is made. Restart the Gitaly StatefulSet after changing it.
Reconfigure the VM to add ops-central as a Gitaly server.
1. Verify that Gitaly is listening on port 8075. ss -plnt | grep gitaly
2. Add the following to /etc/gitlab/gitlab.rb:
```
git_data_dirs({
 "default" => { 'gitaly_address' => 'tcp://10.250.4.5:8075' },
 "ops-central" => { 'gitaly_address' => 'tcp://10.253.7.32:8075' }
})
```
3. Run gitlab-ctl reconfigure after saving.
4. Verify that the list of Gitaly Servers on the VM contains 2 entries matching the config above.

Stop all services on the VM except for Gitaly and Redis so it doesn't write any more data:

gitlab-ctl stop alertmanager
gitlab-ctl stop gitlab-exporter
gitlab-ctl stop gitlab-kas
gitlab-ctl stop gitlab-workhorse
gitlab-ctl stop logrotate
gitlab-ctl stop nginx
gitlab-ctl stop prometheus
gitlab-ctl stop puma
gitlab-ctl stop redis-exporter
gitlab-ctl stop registry
gitlab-ctl stop sidekiq

Dump an up-to-date list of projects on the instance:
1. Connect to the database gcloud sql connect gitlab-ops-0f01 --user=gitlab --database=gitlabhq_production --quiet (password can be found in Vault)
2. Save list of projects locally by running \copy (select id,name,updated_at from projects) to ~/projects.csv with csv delimiter ',';
3. Check that the resulting projects.csv contains the same number of rows as what's in the DB
Move repository data from the Gitaly storage on the VM (default) to the one in K8s (ops-central).
1. Populate the variables and uncomment the curl command in this script https://gitlab.com/gitlab-com/gl-infra/reliability/-/snippets/2527331
2. Ensure that ops.gitlab.net is resolving to the IP of the original VM (i.e. a CloudFlare address, like 172.65.19.90). Check your hosts file!
3. Run the script. This may take a while! Do not proceed to turning off Gitaly until all projects have been migrated to the ops-central shard (see next step)
4. Verify that all projects are now located on the ops-central Gitaly storage:
```
SELECT COUNT(pr.id), sh.name FROM project_repositories pr, shards sh WHERE pr.shard_id = sh.id GROUP BY sh.name;
```
Stop Gitaly on the VM: gitlab-ctl stop gitaly
Migrate Redis data
1. Ensure all services except Redis have been stopped on the VM: gitlab-ctl status
2. Ensure Redis in Kubernetes is running in cluster mode (1 master with 2 slaves), and it is exposed via an internal load balancer: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2050 (merged)
3. On the VM, run gitlab-redis-cli, then info. Note down the number of keys present
4. On the VM, run the following command: gitlab-redis-cli --scan | xargs gitlab-redis-cli MIGRATE <k8s redis IP> 6379 "" 0 5000 auth <password> COPY REPLACE KEYS
5. On the Redis master pod in the cluster, run redis-cli then info. The number of keys should match what was present in the VM before
Configure the us-central instance to use the local redis: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2051 (merged)
Enable Sidekiq on the cluster: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2052 (merged)
Restart the Gitaly StatefulSet on the cluster and re-insert the ConfigMap entry for ops-central.
Move repository data on the cluster from ops-central to default.
1. Modify the script such that it will call the APIs on the cluster instead, and the destination storage is default.
2. Make sure you have an entry in your hosts file resolving ops.gitlab.net to the new IP
3. Run the script
4. Verify in the DB that all projects are back on the default shard:
```
SELECT COUNT(pr.id), sh.name FROM project_repositories pr, shards sh WHERE pr.shard_id = sh.id GROUP BY sh.name;
```
Remove ops-central storage from Gitaly config: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2053 (merged)
Promote ops-central read replica: https://console.cloud.google.com/sql/instances?project=gitlab-ops
Update global.psql.host in the chart values to point at the us-central1 instance: 10.16.1.5 - MR: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2064 (merged)
Update the following DNS records to point to the new load balancers: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5614
- ops.gitlab.net
- registry.ops.gitlab.net
Enable cronjobs, migrations and other miscellaneous things that we disabled for the move: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2054 (merged)
- Ensure it deploys correctly from ops.gitlab.net -- this is a test that repository mirroring + CI pipelines are working.
Set label changecomplete /label ~change::complete

Post change steps

Reenable auto deploys by running /chatops run auto_deploy unpause in #production.
- This should kick off deploys automatically. Monitor the next deployment in #announcements. Reach out to #g_delivery if deploys fail.
Decommission the VM
Move standalone runner VMs
Create a Terraform cleanup MR to move resources such as storage buckets to more sensible places and clean up any resources that are no longer necessary.

TF state surgery for Cloud SQL:

cd config-mgmt/environments/ops
tf state rm 'module.ops-db.google_sql_database_instance.default[0]'
tf import 'module.ops-db.google_sql_database_instance.default[0]' projects/gitlab-ops/instances/ops-central
tf state rm 'module.ops-db.google_sql_database.default[0]'
tf import 'module.ops-db.google_sql_database.default[0]' projects/gitlab-ops/instances/ops-central/databases/default
tf state rm 'module.ops-db.google_sql_user.default[0]'
tf import 'module.ops-db.google_sql_user.default[0]' gitlab-ops/ops-central/default 

tf state rm 'module.ops-db.google_sql_database_instance.replicas[0]'

Perform state surgery to update keepers in the random_password module (update gitlab-0f01 to ops-central) and tf state push the change
Update ops/main.tf with the SQL DB changes: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5715. Should not show any harmful changes in the TF report!
Update ops-db and set ipv4_enabled to false (DB doesn't need a public IP) - MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5634
Remove the ops instance IP from the authorized master access list: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5651
Celebrate

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

If there have been any changes to repository data, copy the Gitaly repository data from the PVC of the Gitaly pod back to the VM
Configure both instances to use the VM Gitaly
Fail over Redis to make the VM instance the master
Configure both instances to use the VM Redis
Fail over CloudSQL to make the us-east1 instance the master
Configure both instances to use the us-east1 CloudSQL (10.16.0.33)
Start the VM instance so it starts accepting traffic
Switch the DNS to point ops.gitlab.net and registry.ops.gitlab.net to the original VM instance's IP
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited May 08, 2023 by Adeline Yeung