Define the Ruby 3 rollout strategy

As part of &865 (closed) / gitlab-org&5149 (closed) GitLab.com will be upgraded to Ruby 3.0. The purpose of this issue is to define the rollout plan for this upgrade.

Current Status

Exit criteria has been met
Moving onto documenting the rollout strategy on the change request in #2749 (closed)

Exit criteria

Expected timeline of the rollout strategy is defined: #2748 (closed)
Expected duration for the hard PCL is defined: 18 hours

Context

Diligence must be taken prior to attempting any sort of production rollout. There's a pre-check list to be completed before this rollout. This is owned by ~"group::application performance"
The upgrade will be done via auto-deployments through the deployment in isolation technique.
Additionally, the upgrade should be performed at the beginning of the release cycle, that'd allow us to fix any bug that may come up.
A rollback strategy will be planned for this rollout #2749 (closed)
Given this is a sensitive change, a PCL may be scheduled for this upgrade. Evidence is required to schedule and approve this one.

The rollout plan.

production#5494 (comment 1076565173)) is the first attempt to define a SaSS rollout strategy and can be considered as a starting point.

The rollout strategy should define the deployment steps to upgrade GitLab.com to Ruby 3 and identify the possible risks when doing so.

Deployment overview

The rollout steps for each environment: staging-canary, production-canary, staging, and production.
Validations and performance tests that should be performed in each environment before moving on to the next one.
Rely on auto-deploy pipeline, but change technician/release manager will need to stop the job before it initiates the next deploy
For production-canary, the strategy should include:
- The time the auto-deploy package should sit before being promoted to production
- The metrics or performance tests to determine whether the deployment can continue to staging/production.
The rollout for staging, the strategy should define:
- The time the auto-deploy package should sit on staging should be deployed first then production.
- Metrics and performance tests should be determined for this environment.
The rollout for production, the strategy should define:
- The time the deployment needs to sit on Gitaly and Praefect, before moving to gprd-kubernetes
- Once it reaches gprd-kubernetes, the time the deployment needs to sit in each cluster
- Metrics and performance tests to be performed on the initial cluster before allowing the deployment to move to another cluster.
Indicators:
- What would be the indicator(s) to verify whether the upgrade was successful?
  - QA tests passing as part of auto-deploy pipeline
  - No incidents related to ruby 3 issues
- What would be the indicator(s) to verify whether the upgrade was not successful and a rollback is required?
  - Unsuccessful deploy
  - Incidents and/or alerts caused by ruby 3
  - Failing metrics

Risks

When planning the rollout strategy the following questions should be considered:

What would be the risk of executing change requests to gprd during the upgrade? What would be the risks of updating feature flags during the upgrade?
- Let's not execute feature flags or CRs during this upgrade, unless it's specifically for the ruby 3 rollout. Too many moving pieces will make the upgrade process even more complicated. In case we need to remedy an incident/alert related to the upgrade, other changes due to feature flags or CRs will definitely complicate the debugging process.
What would be the risks of mixed-deployment testing during the upgrade?
- There are no risks of mixed-deployment testing, as long as there are no migration changes happening in the higher environments. As long as there are no explicit execution of post-deploy migrations, this does not seem to be an issue. Preventing migration changes will also ensure ease of rollback.

Above will guide us and offer evidence on whether a Production Changelock is required for the Ruby upgrade.

Deployment Steps (The UTC are estimates, will be solidified on the actual change request)

(1 day prior, during AMER timezone) Coordinate the time for merging, merging is to be performed by a release manager
(1 day prior, during AMER timezone) Deactivate auto-deploy tasks /chatops run auto_deploy pause
(1 day prior, during AMER timezone) Follow process to deploy risky MR in isolation https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/deploying-risky-mrs-in-isolation.md#process-to-follow-for-deploying-risky-mrs-in-isolation
(1 day prior, during AMER timezone) Prepare the auto-deploy package. Once the auto-deploy pipeline is created in deployer pipelines, watch and cancel the prepare job, so it does not go forward with deploy:gstg-cny.
(Day of the deploy)(9:00 UTC) [gstg-cny] Restart/Start the prepare job so that it will proceed with deploy:gstg-cny.
(10:15 UTC) Once auto-deploy finishes deploying to gstg-cny and coordinated:qa:staging-canary stage jobs successfully, cancel validate_ownership:gstg-cny job. Cancel any jobs related to gprd-cny deployments that automatically start.
We are aiming to bake for 1 hour. In the meanwhile, check for the following for signs of degradation, errors, and/or smoke/reliable failing tests.
- QA Full Suite Testing: Notify QA DRI to start the full suite tests manually after it's done deploying. It should take about 50 minutes. They will review QA pipelines results and investigate if there are any failures. More in comment #2748 (comment 1244007013)
- Dashboards: deployment health, pods health, cluster networking
- Slack channels: #staging, #qa-staging, #qa-staging-ref, #f_ruby3
Confirm with other teams that tests/alerts/metrics are not seeing issues for gstg-cny.
(11:15 UTC) [gprd-cny] Manually start gprd-cny deployments on the auto-deploy pipeline by restarting the cancelled job from step 6.
(13:15 UTC) Once it is done coordinated:qa:canary successfully, check for the following for signs of degradation, errors, and/or failing smoke/reliable tests. We need to deem both canary environments healthy before continuing to the next steps of the deploy. Bake after deploy + smoke/reliable tests for 2 hours.
- QA Full Suite Testing: Notify QA DRI to start the full suite tests manually after it's done deploying. It should take about 50 minutes. They will review QA pipelines results and investigate if there are any failures. More in comment #2748 (comment 1244007013)
- Dashboards: deployment health, pods health,
- Slack channels: #production, #qa-production, #f_ruby3
(15:15 UTC) [gstg] Once there is a green light to deploy to gstg, start the deploy to gstg manually by promoting the package in auto-deploy.
(15:45 UTC) When gstg deploy is done successfully, cancel any gprd deploy jobs that gets automatically started.
Check for the following for signs of degradation, errors, and/or failing tests. Bake after deploy (no smoke/reliable as part of auto-deploy) for 1 hour.
- QA Full Suite Testing: Notify QA DRI to start the full suite tests manually after it's done deploying. It should take about 50 minutes. They will review QA pipelines results and investigate if there are any failures. More in comment #2748 (comment 1244007013)
- Other manual tests: tbd from outcomes of gitlab-org/gitlab#389563 (closed)
- Dashboards: deployment health, pods health, cluster networking
- Slack channels: #staging, #qa-staging, #f_ruby3
(16:45 UTC) [gprd] Once there is a green light to deploy to gprd, manually start the deploy:gprd trigger job
[gprd-gitaly] Deploy to gprd-gitaly first. Cancel any praefect and further stage jobs that automatically get started.
(18:00 UTC) Check for the following for signs of degradation, errors, and/or failing tests for gprd-gitaly deployment. Bake after deploy for 15 minutes.
- Dashboards: deployment health, gitaly overview
- Slack channels: #production, #qa-production, #f_ruby3
(18:15 UTC) [gprd-praefect] Manually start the next stage jobs for gprd-praefect. Cancel any jobs in the later stages (gprd-kubernetes)
(18:20) Check for the following for signs of degradation, errors, and/or failing tests for gprd-praefect deployment. Bake after deploy for 15 minutes.
- Dashboards: deployment health
- Slack channels: #production, #qa-production, #f_ruby3
(18:35 UTC) [gprd regional cluster] Deploy to kubernetes regional cluster by manually starting gprd-kubernetes stage jobs, which should start gprd:auto-deploy job, and cancel any zonal deploy jobs (gprd-us-east-1-x:auto-deploy). This should only affect sidekiq.
(18:40 UTC) Check for the following for signs of degradation, errors, and/or failing tests for gprd regional deployment. Bake after deploy for 30 minutes.
- Dashboards: Kubernetes compute resource cluster, pods health, cluster networking
- Slack channels: #production, #qa-production, #f_ruby3
(19:10 UTC) [gprd first zonal cluster] Deploy to zonal cluster, manually start the first zonal auto-deploy job (gprd-us-east1-b:auto-deploy)
(19:20 UTC) Check for the following for signs of degradation, errors, and/or failing tests for gprd single zone deployment. Bake after deploy for 30 minutes.
- Dashboards: Kubernetes compute resource cluster, pods health, cluster networking
- Slack channels: #production, #qa-production, #f_ruby3
(19:50 UTC) [gprd second zonal cluster] Deploy to second zonal cluster (gprd-us-east1-c:auto-deploy). Cancel the third one (grpd-us-east1-d:auto-deploy).
(20:00 UTC) Check for the following for signs of degradation, errors, and/or failing tests for second zonal gprd deployment. Bake after deploy for 10 minutes.
- Dashboards: Kubernetes compute resource cluster, pods health, cluster networking
- Slack channels: #production, #qa-production, #f_ruby3
(20:10 UTC) [gprd third zonal cluster] Deploy to final/third zonal cluster (gprd-us-east1-d:auto-deploy).
(20:20 UTC) Check for the following for signs of degradation, errors, and/or failing tests for third zonal gprd deployment. Bake after deploy for 10 minutes.
- Dashboards: Kubernetes compute resource cluster, pods health, cluster networking
- Slack channels: #production, #qa-production, #f_ruby3
(20:30 UTC) Ruby 3 should now be successfully deployed. Resume auto-deploy tasks.

Total Duration of the deploy

Following the above deploy plan, it should take ~11.5 hours on the day of the deploy.

Edited Feb 09, 2023 by Jenny Kim