Switch to Ruby 3

Production Change

Change Summary

Upgrade gitlab-rails to Ruby 3.

There is more risk in this upgrade compared to past Ruby upgrades:

It is a major version release containing breaking changes. We do our best to detect these in CI, but there can always be things slipping past us.
We found two bugs in Ruby 3.0.x during this migration, which gives reason for caution. These bugs were fixed in subsequent Ruby updates but we may find more issues when going live.
While performance testing in lab environments showed no obvious performance regressions (a slight improvement in fact), we will not know how Ruby 3 performs at scale and under real user traffic unless we go live with it.
The impact of drift between the code base and what we run on production nodes could be more severe compared to past updates: code written in Ruby 3 may not run on instances remaining on Ruby 2.

Starting with my initial proposal in gitlab-org&5149 (comment 671192211), I think this is how we could approach a rollout that is both safe and easy to revert:

Pre-requisites

build images for Ruby 3 exist
CNG images for Ruby 3 exist
GitLab builds fine in CI with Ruby 3, QA has not detected regressions
initial performance test runs have not signaled any perf or resource use regressions

~"group::application performance" groupscalability groupdelivery own this work and make sure it's completed prior to the rollout.

Change Details

Services Impacted - ServiceWeb ServiceAPI ServiceGit ServiceWebsockets ServiceSidekiq
Change Technician - (Release managers) EMEA: @nolith, AMER: @sabrams, APAC: @ggillies
Change Reviewer - @skarbek, @jarv
Time tracking - 12.5 hours
Downtime Component - None

Detailed steps for the change

Defined and discussed in delivery#2748 (closed)

Before Pre-change steps

Directly notify the EOC using the @sre-oncall alias and have the change approved by the EOC by obtaining the eoc_approved label on this Change Request issue

Pre-Change Steps - steps to be completed on March 6, during APAC and AMER timezones

Estimated Time to Complete (mins) - 45 minutes

Change Steps - Starting at 9:00 UTC on March 7

Estimated Time to Complete (mins) - 840 minutes (14 hours)

Post-Change Steps

Estimated Time to Complete (mins) - 1 minute

Remove CHANGE_LOCK_OVERRIDE from the CI variables https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/settings/ci_cd
Do not run post-deploy migrations. The last PDM should have been run on March 3, and the next PDM will be run on March 10, 48 hours after this change has been made to gprd.
(2022-03-09 01:26 UTC) After 24 hours after deploy to gprd: re-evaluate our next PDM date (48 hours after gprd deploy) as we cannot rollback after this PDM is run.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 min-1 hr

The SRE or App Perf/Dev team DRI from the DRI timetable are responsible for making the decision to rollback. The following runbook is suited for all the environments: https://gitlab.com/gitlab-org/release/docs/-/blob/master/runbooks/rollback-a-deployment.md#rolling-back

Monitoring

Key metrics to observe

Dashboards/metrics:
- Monitor the following dashboards for unhealthy dip in service health for the environment/cluster that is being rolled out.
- Deployment health, configurable with environment, stage, and type/service
- Kubernetes compute resource/cluster health, configurable with clusters
- Kubernetes compute resource/pods health, configurable with clusters and namespace
- Kubernetes networking, configurable with clusters
- Per-service dashboards (change env and stage to toggle between gstg/gprd and main/cny):
  - api (overview, containers)
  - web (overview, containers)
  - websockets (overview, containers)
  - git (overview, containers)
  - sidekiq (overview, containers)
- Kibana - Puma (edit json.type to filter by service, json.stage for cny vs main)
  - Production 5xx responses
  - Staging 5xx responses
- Kibana - Sidekiq (edit json.shard to switch between job types)
  - Failed production jobs
  - Failed staging jobs
- Sentry
  - Production overview
  - Staging overview
QA runs can be observed via Slack:
- #announcements - Besides QA messages, multiple messages are sent to this channel to account for the different deployments.
- QA slack channels - There is a channel per environment, for example, a failure on gstg and gstg-cny will be posted in #qa-staging, a failure on gprd-cny and gprd will be posted in #qa-production, etc.
Dealing with deploy failures: https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/failures.md
Dedicated slack channel for the migration: #ruby3-rollout
DRI Timetable for teams involved in the rollout: delivery#2750 (closed)

Summary of infrastructure changes

[-] Does this change introduce new compute instances? N/A
[-] Does this change re-size any existing compute instances? N/A
[-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? N/A

No infrastructure changes during this change.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Mar 08, 2023 by Jenny Kim