Move Ops instance to Kubernetes in us-central region
Production Change
Change Summary
We are moving the ops instance from us-east (where production is) to us-central in order to provide some geographical redundancy in case of a problem in us-east. At the same time, we are migrating the ops instance from a single VM without redundancy to a Kubernetes deployment.
Since we can't have two instances running at the same time with the same data due to all of the automation that is constantly running, we cannot properly test an export/import migration. There is also the problem of having two independent instances managing data in the object storage buckets. The only way to test whether all functions of the Kubernetes instance work the same as the VM instance is to run them at the same time and compare. To accomplish this, we can run the Kubernetes instance pointed at same postgres, gitaly and redis as the VM, without interrupting normal operation of the VM instance. This will effectively make it behave like one instance with multiple copies of most components. Once all of the configuration and functionality is validated, we can switch over by moving those services to the already configured Kubernetes instance.
Theoretically, we should be moving the VM instance to us-central before connecting things in us-central to it. This would avoid sending latency sensitive traffic between sites and incurring the ~30ms latency penalty. Since we haven't been able to find anything definitive saying that things will definitely break if we do this, we are going to proceed without moving the VM first. There will be no real user traffic running cross site, so any failures will be limited to the tests. If any of those tests fail due to the added latency, we will proceed with the backup plan of moving the VM.
Once both instances are running together and functioning the same, we will cut over by making the us-central copies of the database, redis and gitaly the master versions, and we will point the DNS to the us-central instance's ingress.
Change Details
- Services Impacted - Ops Instance
- Change Technician - @ayeung
- Change Reviewer - @devin @gsgl
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - YES (when cutting over to k8s)
Detailed steps for the change
Pre Change Steps - steps to prepare for the change
-
Spin down and remove standalone CloudSQL instance in us-central -
Create CloudSQL replica in us-central. This will be done by hand since the Terraform module does not support multi-region replicas (to be imported into TF later) -
Verify that we have no firewall or network customizations in us-east that are not in us-central -
Execute CR to open Gitaly and Redis ports on the VM: #8599 (closed) -
Copy config of the existing instance to helm values - helm template should not error -
Copy secrets from config file to vault and set up plumbing to get it into the helm values
Connect and Transition new instance
-
Disable all services and components in helm chart -
Ensure that no cron jobs are defined in the chart values as we don't want them triggering until we've migrated to k8s. -
Connect monitoring and logging to the kubernetes instance and verify that we have at least as much coverage as we had for the VM instance. -
Configure Helm chart to point at the live Cloud SQL instance, and at Redis running on the VM. -
Enable the following services (these can be done before the migration): -
Webservice and API -
Registry -
KAS -
Shell
For each service listed above:
-
Enable service in helm chart -
Compare all options related to this service between the helm values and the VM's config file and make them match. Use the commands documented here: delivery#1065 (comment 629665704) -
Debug everything to make sure that when forcing your laptop to use the kubernetes ingress as the endpoint instead of the ops endpoint, everything is indistinguishable from using the VM as the endpoint. -
Iterate on everything that doesn't match
Services not getting moved
These services will not be enabled or moved:- Postgres (we’ll use Cloud SQL)
- MinIO (we’ll use GCS)
- Mattermost (we don’t use)
- Certmanager (from the default chart - we use our own chart for it)
- Prometheus and Grafana (from the default chart - we use the exporter in the chart and scrape from our existing Prometheus instead)
- Pages (we don’t use)
- Praefect (we don’t use)
-
-
Verify that KAS is working by checking the logs for any errors, as described in https://docs.gitlab.com/ee/user/clusters/agent/troubleshooting.html
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Ping @release-managers
in Slack at 12:00 UTC time on Friday 28th April about disabling auto deploys:/chatops run auto_deploy pause
. See Slack thread: https://gitlab.slack.com/archives/C8PKBH3M5/p1682683200999699 -
Take a snapshot of the boot & data disks: -
boot disk: gcloud compute snapshots create boot-disk-ops-gitlab-net-pre-k8s-migration --project=gitlab-ops --source-disk=gitlab-01-inf-ops --source-disk-zone=us-east1-c --guest-flush --storage-location=us-east1
-
data disk: gcloud compute snapshots create data-disk-ops-gitlab-net-pre-k8s-migration --project=gitlab-ops --source-disk=gitlab-01-inf-ops-data --source-disk-zone=us-east1-c --guest-flush --storage-location=us-east1
-
-
Take an on-demand backup of the DB: https://console.cloud.google.com/sql/instances/gitlab-ops-0f01/backups?project=gitlab-ops -
Set label changein-progress /label ~"blocks deployments"
-
Set label changein-progress /label ~"blocks feature-flags"
-
Set label changein-progress /label ~change::in-progress
-
Verify one final time that both instances are still running the same version -
Enable Gitaly on k8s and configure to use its own Gitaly: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2046 (merged) -
Manually modify the ops-gitlab-gitaly
ConfigMap to use both the internal pod, and the VM for Gitaly. Something like the following should be added under the templating for storages (this is not supported by the helm chart, but we only need it while initially moving the data, so it's ok if it gets overwritten when we are finished):[[storage]] name = "ops-central" path = "/home/git/repositories"
So there should be 2 storages listed when viewing the list of Gitaly Servers on the cluster side:
default
, which will point to port8075
of the Gitaly pod, andops-central
, which will point to the exposed IP:port of the Gitaly service in the cluster. In practice they're the same thing.This step will need to be done multiple times, as the
ConfigMap
will be overwritten every time a Gitaly config change is made. Restart the GitalyStatefulSet
after changing it. -
Reconfigure the VM to add ops-central
as a Gitaly server.-
Verify that Gitaly is listening on port 8075. ss -plnt | grep gitaly
-
Add the following to /etc/gitlab/gitlab.rb
:git_data_dirs({ "default" => { 'gitaly_address' => 'tcp://10.250.4.5:8075' }, "ops-central" => { 'gitaly_address' => 'tcp://10.253.7.32:8075' } })
-
Run gitlab-ctl reconfigure
after saving. -
Verify that the list of Gitaly Servers on the VM contains 2 entries matching the config above.
-
-
Stop all services on the VM except for Gitaly and Redis so it doesn't write any more data: gitlab-ctl stop alertmanager gitlab-ctl stop gitlab-exporter gitlab-ctl stop gitlab-kas gitlab-ctl stop gitlab-workhorse gitlab-ctl stop logrotate gitlab-ctl stop nginx gitlab-ctl stop prometheus gitlab-ctl stop puma gitlab-ctl stop redis-exporter gitlab-ctl stop registry gitlab-ctl stop sidekiq
-
Dump an up-to-date list of projects on the instance: -
Connect to the database gcloud sql connect gitlab-ops-0f01 --user=gitlab --database=gitlabhq_production --quiet
(password can be found in Vault) -
Save list of projects locally by running \copy (select id,name,updated_at from projects) to ~/projects.csv with csv delimiter ',';
-
Check that the resulting projects.csv
contains the same number of rows as what's in the DB
-
-
Move repository data from the Gitaly storage on the VM ( default
) to the one in K8s (ops-central
).-
Populate the variables and uncomment the curl
command in this script https://gitlab.com/gitlab-com/gl-infra/reliability/-/snippets/2527331 -
Ensure that ops.gitlab.net
is resolving to the IP of the original VM (i.e. a CloudFlare address, like172.65.19.90
). Check your hosts file! -
Run the script. This may take a while! Do not proceed to turning off Gitaly until all projects have been migrated to the ops-central
shard (see next step) -
Verify that all projects are now located on the ops-central
Gitaly storage:SELECT COUNT(pr.id), sh.name FROM project_repositories pr, shards sh WHERE pr.shard_id = sh.id GROUP BY sh.name;
-
-
Stop Gitaly on the VM: gitlab-ctl stop gitaly
-
Migrate Redis data -
Ensure all services except Redis have been stopped on the VM: gitlab-ctl status
-
Ensure Redis in Kubernetes is running in cluster mode (1 master with 2 slaves), and it is exposed via an internal load balancer: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2050 (merged) -
On the VM, run gitlab-redis-cli
, theninfo
. Note down the number of keys present -
On the VM, run the following command: gitlab-redis-cli --scan | xargs gitlab-redis-cli MIGRATE <k8s redis IP> 6379 "" 0 5000 auth <password> COPY REPLACE KEYS
-
On the Redis master pod in the cluster, run redis-cli
theninfo
. The number of keys should match what was present in the VM before
-
-
Configure the us-central instance to use the local redis: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2051 (merged) -
Enable Sidekiq on the cluster: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2052 (merged) -
Restart the Gitaly StatefulSet
on the cluster and re-insert theConfigMap
entry forops-central
. -
Move repository data on the cluster from ops-central
todefault
.-
Modify the script such that it will call the APIs on the cluster instead, and the destination storage is default
. -
Make sure you have an entry in your hosts file resolving ops.gitlab.net
to the new IP -
Run the script -
Verify in the DB that all projects are back on the default
shard:SELECT COUNT(pr.id), sh.name FROM project_repositories pr, shards sh WHERE pr.shard_id = sh.id GROUP BY sh.name;
-
-
Remove ops-central
storage from Gitaly config: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2053 (merged) -
Promote ops-central
read replica: https://console.cloud.google.com/sql/instances?project=gitlab-ops -
Update global.psql.host
in the chart values to point at theus-central1
instance:10.16.1.5
- MR: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2064 (merged) -
Update the following DNS records to point to the new load balancers: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5614 -
ops.gitlab.net -
registry.ops.gitlab.net
-
-
Enable cronjobs, migrations and other miscellaneous things that we disabled for the move: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2054 (merged) - Ensure it deploys correctly from ops.gitlab.net -- this is a test that repository mirroring + CI pipelines are working.
-
Set label changecomplete /label ~change::complete
Post change steps
-
Reenable auto deploys by running /chatops run auto_deploy unpause
in#production
.-
This should kick off deploys automatically. Monitor the next deployment in #announcements
. Reach out to#g_delivery
if deploys fail.
-
-
Decommission the VM -
Move standalone runner VMs -
Create a Terraform cleanup MR to move resources such as storage buckets to more sensible places and clean up any resources that are no longer necessary. -
TF state surgery for Cloud SQL: cd config-mgmt/environments/ops tf state rm 'module.ops-db.google_sql_database_instance.default[0]' tf import 'module.ops-db.google_sql_database_instance.default[0]' projects/gitlab-ops/instances/ops-central tf state rm 'module.ops-db.google_sql_database.default[0]' tf import 'module.ops-db.google_sql_database.default[0]' projects/gitlab-ops/instances/ops-central/databases/default tf state rm 'module.ops-db.google_sql_user.default[0]' tf import 'module.ops-db.google_sql_user.default[0]' gitlab-ops/ops-central/default tf state rm 'module.ops-db.google_sql_database_instance.replicas[0]'
-
Perform state surgery to update keepers
in the random_password module (updategitlab-0f01
toops-central
) andtf state push
the change -
Update ops/main.tf
with the SQL DB changes: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5715. Should not show any harmful changes in the TF report! -
Update ops-db
and setipv4_enabled
tofalse
(DB doesn't need a public IP) - MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5634 -
Remove the ops instance IP from the authorized master access list: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5651 -
Celebrate
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
If there have been any changes to repository data, copy the Gitaly repository data from the PVC of the Gitaly pod back to the VM -
Configure both instances to use the VM Gitaly -
Fail over Redis to make the VM instance the master -
Configure both instances to use the VM Redis -
Fail over CloudSQL to make the us-east1 instance the master -
Configure both instances to use the us-east1 CloudSQL ( 10.16.0.33
) -
Start the VM instance so it starts accepting traffic -
Switch the DNS to point ops.gitlab.net
andregistry.ops.gitlab.net
to the original VM instance's IP -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.