Switch to Ruby 3
Production Change
Change Summary
Upgrade gitlab-rails
to Ruby 3.
There is more risk in this upgrade compared to past Ruby upgrades:
- It is a major version release containing breaking changes. We do our best to detect these in CI, but there can always be things slipping past us.
- We found two bugs in Ruby 3.0.x during this migration, which gives reason for caution. These bugs were fixed in subsequent Ruby updates but we may find more issues when going live.
- While performance testing in lab environments showed no obvious performance regressions (a slight improvement in fact), we will not know how Ruby 3 performs at scale and under real user traffic unless we go live with it.
- The impact of drift between the code base and what we run on production nodes could be more severe compared to past updates: code written in Ruby 3 may not run on instances remaining on Ruby 2.
Starting with my initial proposal in gitlab-org&5149 (comment 671192211), I think this is how we could approach a rollout that is both safe and easy to revert:
Pre-requisites
- build images for Ruby 3 exist
- CNG images for Ruby 3 exist
- GitLab builds fine in CI with Ruby 3, QA has not detected regressions
- initial performance test runs have not signaled any perf or resource use regressions
~"group::application performance" groupscalability groupdelivery own this work and make sure it's completed prior to the rollout.
Change Details
- Services Impacted - ServiceWeb ServiceAPI ServiceGit ServiceWebsockets ServiceSidekiq
- Change Technician - (Release managers) EMEA: @nolith, AMER: @sabrams, APAC: @ggillies
- Change Reviewer - @skarbek, @jarv
- Time tracking - 12.5 hours
- Downtime Component - None
Detailed steps for the change
Defined and discussed in delivery#2748 (closed)
Before Pre-change steps
-
Directly notify the EOC using the @sre-oncall
alias and have the change approved by the EOC by obtaining the eoc_approved label on this Change Request issue
Pre-Change Steps - steps to be completed on March 6, during APAC and AMER timezones
Estimated Time to Complete (mins) - 45 minutes
-
(03:30 UTC) Run /chatops run auto_deploy pause
inf_upcoming_release
-
Do not run post-deploy migrations. The last PDM should have been run on March 3, and the next PDM will be run on March 9, 48 hours after this change has been made to gprd
. -
See the latest created package through to production. It should be done by around 10:30 UTC. -
Edit the rollback steps to include the last auto-deploy package that was rolled out to gprd successfully -
(17:00 UTC) Notify MR authors and stakeholders that auto-deploy is now paused and the MRs are getting merged. Slack channel: #ruby3-rollout
-
Announce the start of the plan execution in the #production
Slack channel and tag@sre-oncall
-
Set label changein-progress on this issue -
Turn on USE_NEXT_RUBY_VERSION_IN_AUTODEPLOY
CI variable totrue
to make the auto deploy build with ruby 3 -
Merge in MR for gitlab and set labels pick into autodeploy
andseverity::1
: gitlab-org/gitlab!113276 (merged) -
Check that this MR is the only MR with Pick into auto-deploy
label here. -
Manually trigger the auto_deploy:pick
scheduled pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/1772193 -
Make sure that we only see expected commits on the default branch. -
Enable auto_deploy_tag_latest feature flag -
Manually trigger the auto_deploy:tag
scheduled pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/1772209 -
Disable auto_deploy_tag_latest feature flag -
Get ready to cancel prepare
job on the new auto-deploy pipeline -
Check that wait:cng
andwait:omnibus
are completed and successful -
Cancel the prepare
job (and any gstg-cny/ref deploy jobs if they start) -
Auto-deploy pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/1772211
Change Steps - Starting at 9:00 UTC on March 7
Estimated Time to Complete (mins) - 840 minutes (14 hours)
-
Add CHANGE_LOCK_OVERRIDE
totrue
into the CI variables https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/settings/ci_cd -
gstg-cny (09:00 UTC) -
Restart the prepare
job so that it will proceed withdeploy:gstg-cny
anddeploy:gstg-ref
. The deploy + smoke tests usually take just over an hour. -
Cancel validate_ownership:gstg-cny
job so it will not startdeploy:gprd-cny
. Cancel the gprd-cny deploy if it automatically starts. -
Bake for 1 hour. Check for signs of degradation, errors, and failing tests. -
QA Full Suite Testing: Notify QA DRI (@grantyoung) to start the full tests suite after deploy and smoke/reliable auto-deploy jobs are completed successfully. -
@grantyoung to review QA pipelines results, investigate if there are any failures, and provide approval if successful. -
Dashboards: deployment health, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
Performance Application team DRI from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to gprd-cny
.
-
-
gprd-cny (11:15 UTC) -
Start deploy to gprd-cny
by restarting the cancelled jobvalidate_ownership:gstg-cny
from above. ETA for the deploy and smoke tests is ~2 hours. -
Cancel the baking_time
job. -
Bake for 2 hours. Check for signs of degradation, errors, and failing tests. -
QA Full Suite Testing: Notify QA DRI (@grantyoung) to start the full tests suite after deploy and smoke/reliable auto-deploy jobs are completed successfully. -
@grantyoung to review QA pipelines results, investigate if there are any failures, and provide approval if successful. -
Dashboards: deployment health, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
Go/No Go decision: Ensure both gstg-cny
andgprd-cny
are healthy before proceeding to promote togstg
deploy-
SRE and Performance Application team DRIs from the DRI timetable have given go/no go in a comment on this change request to proceed our deploy to gstg
.
-
-
-
gstg ( 15:1515:45 UTC)-
Start deploy to gstg by starting the promote
job. ETA for the deploy is ~30 minutes. -
After gstg
deployment is successful, cancel thegprd-warmup
job to prevent anygprd
deployment (production-promote/production-gitaly) jobs to get started. -
Bake for 1 hour. Check for signs of degradation, errors, and failing tests. -
QA Full Suite Testing: Notify QA DRI (@grantyoung) to start the full tests suite after deploy and smoke/reliable auto-deploy jobs are completed successfully. -
@dchevalier2 to review QA pipelines results, investigate if there are any failures, and provide approval if successful. -
Dashboards: deployment health, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
SRE and Performance Application team DRIs from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to gprd
.
-
-
gprd-gitaly ( 16:4518:00 UTC)-
Obtain confirmation in #ruby3-rollout from SRE and Application Performance DRIs that gprd
deploy can start -
Start deploy to gprd-gitaly
by restarting the cancelled jobgprd-warmup
above. ETA to complete is ~1 hour. -
Cancel the gprd-praefect
job. -
Bake for 15 minutes. Check for signs of degradation, errors, and failing tests. -
Dashboards: deployment health, gitaly overview. More dashboards in the key metrics to observe section.
-
-
SRE and Performance Application team DRIs from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to gprd-praefect
.
-
-
gprd-praefect ( 18:1520:15 UTC)-
Start deploy to gprd-praefect
by restarting the cancelled jobgprd-praefect
job above. ETA to complete is ~15 minutes. -
Cancel gprd-prepare
andgprd-kubernetes
(production-prepare and production-fleet) jobs. Ifgprd-kubernetes
created another downstream pipeline (click into the job, and it should have the url to the triggered pipeline), make sure that no downstream jobs are running. Cancel them if they started. -
Bake for 15 minutes. Check for signs of degradation, errors, and failing tests. -
Dashboards: deployment health. More dashboards in the key metrics to observe section.
-
-
SRE and Performance Application team DRIs from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to gprd-kubernetes
.
-
-
gprd-kubernetes regional cluster ( 18:3520:35 UTC)-
Start deploy to gprd-kubernetes regional cluster
by restarting the cancelled jobsgprd-prepare
andgprd-kubernetes
above. If the cancelledgprd-kubernetes
job created another downstream pipeline before cancellation (it should have the url in the job), we want to use that initial downstream pipeline. The retry then will be a no-op, and it's just to make this job green. If the cancelled job did not get to create a downstream pipeline before the cancellation, then this retry will create the downstream pipeline for us to use. ETA to complete is ~5 minutes. -
In the downstream pipeline, -
Cancel gprd-us-east1-b:auto-deploy
-
Cancel gprd-us-east1-c:auto-deploy
-
Cancel gprd-us-east1-d:auto-deploy
-
Make sure that every job before the gprd:deploy:alpha
stage is successful -
Make sure gprd:auto_deploy
completes successfully. This makes changes to sidekiq.
-
-
Bake for 30 minutes. Check for signs of degradation, errors, and failing tests. Rollback if we encounter any issues here. -
Dashboards: Kubernetes compute resource cluster, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
SRE and Performance Application team DRIs from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to gprd zonal cluster gprd-us-east1-b
.
-
-
gprd zonal cluster gprd-us-east1-b ( 19:1021:25 UTC)-
Deploy to the first zonal cluster by restarting the cancelled gprd-us-east1-b:auto-deploy
job. This usually takes 10 minutes. This usually takes 10 minutes. -
Bake for 3 hours. Check for signs of degradation, errors, and failing tests. Rollback if we encounter any issues here. -
Dashboards: Kubernetes compute resource cluster, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
SRE and Performance Application team DRIs from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to the next zonal cluster gprd-us-east1-c
.
-
-
gprd zonal cluster gprd-us-east1-c ( 22:202022-03-08 00:40 UTC)-
Deploy to the second zonal cluster by restarting the cancelled gprd-us-east1-c:auto-deploy
job. This usually takes about 10 minutes. -
Bake for 10 minutes. Check for signs of degradation, errors, and failing tests. Rollback if we encounter any issues here. -
Dashboards: Kubernetes compute resource cluster, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
SRE and Performance Application team DRIs from the DRI timetable have given green lights in a comment on this change request to proceed our deploy to the last zonal cluster gprd-us-east1-d
.
-
-
gprd zonal cluster gprd-us-east1-d ( 22:402022-03-08 01:02 UTC)-
Deploy to the final zonal cluster by restarting the cancelled gprd-us-east1-d:auto-deploy
job. This usually takes about 10 minutes. -
Bake for 10 minutes. Check for signs of degradation, errors, and failing tests. Rollback if we encounter any issues here. -
Dashboards: Kubernetes compute resource cluster, pods health, cluster networking. More dashboards in the key metrics to observe section.
-
-
-
Deploy finished ( 23:002022-03-08 01:26 UTC)
Post-Change Steps
Estimated Time to Complete (mins) - 1 minute
-
Remove CHANGE_LOCK_OVERRIDE
from the CI variables https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/settings/ci_cd -
Do not run post-deploy migrations. The last PDM should have been run on March 3, and the next PDM will be run on March 10, 48 hours after this change has been made to gprd
. -
(2022-03-09 01:26 UTC) After 24 hours after deploy to gprd
: re-evaluate our next PDM date (48 hours aftergprd
deploy) as we cannot rollback after this PDM is run.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 min-1 hr
The SRE or App Perf/Dev team DRI from the DRI timetable are responsible for making the decision to rollback. The following runbook is suited for all the environments: https://gitlab.com/gitlab-org/release/docs/-/blob/master/runbooks/rollback-a-deployment.md#rolling-back
-
A SRE or App Perf/Dev team DRI commented on this change request with the decision to rollback. Please note the list of environments that are getting rolled back. -
Roll back the auto-deploy package following this runbook for all affected/deployed environments. -
The last auto-deploy package auto-deploy version
that successfully got deployed togprd
:15.10.202303060320-d244fd30a63.41707614427
-
If production is being rolled back, make sure to drain canary: https://gitlab.com/gitlab-org/release/docs/-/blob/master/runbooks/rollback-a-deployment.md#3-perform-the-rollback -
If gitaly & praefect is being rolled back: Follow the optional steps for gitaly & praefect.
-
-
Remove USE_NEXT_RUBY_VERSION_IN_AUTODEPLOY
CI variables to make the auto deploy build with ruby 2 -
Revert Merge requests performing the upgrade. Refer to runbook on how to check for MR deployment progress. -
Make sure the revert MRs are in the next auto-deploy package that gets deployed.
-
Monitoring
Key metrics to observe
- Dashboards/metrics:
- Monitor the following dashboards for unhealthy dip in service health for the environment/cluster that is being rolled out.
- Deployment health, configurable with environment, stage, and type/service
- Kubernetes compute resource/cluster health, configurable with clusters
- Kubernetes compute resource/pods health, configurable with clusters and namespace
- Kubernetes networking, configurable with clusters
- Per-service dashboards (change
env
andstage
to toggle betweengstg
/gprd
andmain
/cny
):-
api
(overview, containers) -
web
(overview, containers) -
websockets
(overview, containers) -
git
(overview, containers) -
sidekiq
(overview, containers)
-
- Kibana - Puma (edit
json.type
to filter by service,json.stage
forcny
vsmain
) - Kibana - Sidekiq (edit
json.shard
to switch between job types) - Sentry
- QA runs can be observed via Slack:
-
#announcements
- Besides QA messages, multiple messages are sent to this channel to account for the different deployments. - QA slack channels - There is a channel per environment, for example, a failure on gstg and gstg-cny will be posted in
#qa-staging
, a failure on gprd-cny and gprd will be posted in#qa-production
, etc.
-
- Dealing with deploy failures: https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/failures.md
- Dedicated slack channel for the migration:
#ruby3-rollout
- DRI Timetable for teams involved in the rollout: delivery#2750 (closed)
Summary of infrastructure changes
- [-] Does this change introduce new compute instances? N/A
- [-] Does this change re-size any existing compute instances? N/A
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? N/A
No infrastructure changes during this change.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.