2025-12-03: Test Rollout of Ruby 3.3 to gstg-cny, gprd-cny and gstg-ref
# Production Change ### Change Summary The plan is to test switching auto-deploys to use Ruby 3.3 and perform the deployment of a Ruby 3.3 package to gitlab.com staging-canary (gstg-cny) and production-canary (gprd-cny). During the change duration, auto-deploys will need to be paused. ### Change Details 1. **Services Impacted** - GitLab Rails, Sidekiq, and any other service that uses Ruby 2. **Change Technician** - `@jennykim-gitlab` `@dat.tang.gitlab` 3. **Change Reviewer** - @rpereira2 4. **Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM)** - 2025-12-03 11:00 5. **Time tracking** - 420 minutes 6. **Downtime Component** - none ### Set Maintenance Mode in GitLab ~~If your change involves scheduled maintenance, add a step to set and ~~[~~unset maintenance mode~~](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/set_maintenance_window.md)~~ per our runbooks. This will make sure ~~[~~SLA calculations adjust~~](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/5887)~~ for the maintenance period.~~ No need for setting Maintenance mode. ### Engineer Helping Monitor The Test Rollout @igor.drozdov @bmarjanovic ## Preparation > [!note] > > The following checklists must be done in advance, before setting the label ~&quot;change::scheduled&quot; ### Change Reviewer checklist <!--To be filled out by the reviewer.--> ~8276990 ~8276981 ~8276978 ~8276976: - [x] Check if the following applies: - The **scheduled day and time** of execution of the change is appropriate. - The [change plan](#detailed-steps-for-the-change) is technically accurate. - The change plan includes **estimated timing values** based on previous testing. - The change plan includes a viable [rollback plan](#rollback). - The specified [metrics/monitoring dashboards](#key-metrics-to-observe) provide sufficient visibility for the change. ~8276978 ~8276976: - [ ] Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). - The change plan includes success measures for all steps/milestones during the execution. - The change adequately minimizes risk within the environment/service. - The performance implications of executing the change are well-understood and documented. - The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? - The change has a primary and secondary SRE with knowledge of the details available during the change window. - The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval. - The labels ~&quot;blocks deployments&quot; and/or ~&quot;blocks feature-flags&quot; are applied as necessary. ### Change Technician checklist <!--Search [the incident.io schedule](https://app.incident.io/gitlab/on-call/schedules/01K5YWAGZ7YCQGAG7ATQ9XQWHW) to find who will be on-call at the scheduled day and time. SREs on-call must be informed of weekend C1 changes at least 2 weeks in advance. You can also use the `@sre-oncall` handle in slack to find the current on-call team member.--> - [x] The [Change Criticality](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#change-criticalities) has been set appropriately and requirements have been reviewed. - [x] The [change plan](#detailed-steps-for-the-change) is technically accurate. - [x] The [rollback plan](#rollback) is technically accurate and detailed enough to be executed by anyone with access. - [x] This Change Issue is linked to the appropriate Issue and/or Epic - [ ] ~~Change has been tested in staging and results noted in a comment on this issue.~~ - [ ] ~~A dry-run has been conducted and results noted in a comment on this issue.~~ - [x] Dry-run and testing on staging don't apply here, because this is the test run, we we apply the change to Canary for a short amount of time (6-7 hours) - [x] The change execution window respects the [Production Change Lock periods](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/#production-change-lock-pcl). - [x] Once all boxes above are checked, mark the change request as scheduled: `/label ~"change::scheduled"` - [ ] For ~8276976 and ~8276978 change issues, the change event is added to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar by the [change-scheduler bot](https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/change-scheduler). It is schedule to run every 2 hours. - [ ] For ~8276976 change issues, a Senior Infrastructure Manager has provided approval with the ~14866676 label on the issue. - [ ] For ~8276978 change issues, an Infrastructure Manager provided approval with the ~14866676 label on the issue. - [ ] For ~8276976 and ~8276978 changes, mention `@gitlab-org/saas-platforms/inframanagers` in this issue to provide visibility to all infrastructure managers. - [x] For ~8276976, ~8276978, or ~&quot;blocks deployments&quot; change issues, confirm with Release managers that the change does not overlap or hinder any release process (In `#production` channel, mention `@release-managers` and this issue and await their acknowledgment.) ## Detailed steps for the change ### Pre-execution steps > [!note] > > The following steps should be done right at the scheduled time of the change request. The [preparation steps](#preparation) are listed below. - [x] Make sure all tasks in [Change Technician checklist](#change-technician-checklist) are done - [ ] For ~8276976 and ~8276978 change issues, the SRE on-call has been informed prior to change being rolled out. (Check [the incident.io GitLab.com Production EOC schedule](https://app.incident.io/gitlab/on-call/schedules/01K5YWAGZ7YCQGAG7ATQ9XQWHW) to find who will be on-call at the scheduled day and time. SREs on-call must be informed of [plannable C1 changes](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#approval) at least 2 weeks in advance.) - [ ] The SRE on-call provided approval with the ~25771657 label on the issue. - [x] For ~8276976, ~8276978, or ~&quot;blocks deployments&quot; change issues, Release managers have been informed prior to change being rolled out. (In `#production` channel, mention `@release-managers` and this issue and await their acknowledgment.) - [x] There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=opened&label_name%5B%5D=Incident%3A%3AActive&or%5Blabel_name%5D%5B%5D=severity%3A%3A1&or%5Blabel_name%5D%5B%5D=severity%3A%3A2&first_page_size=20) that are ~3760139 or ~3760140 - [ ] If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change. ### Change Steps - steps to take to execute the change _Estimated Time to Complete (mins)_ - 420 minutes - [x] Set label ~15057610 `/label ~change::in-progress` #### Rollout We will follow the steps documented under https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/ruby-upgrades.md#test-rollouts to perform the test rollout ##### Pause auto deploys - [x] Let `@sre-oncall` and `@release-managers` know in `#production` that the CR is starting and auto deployments will be paused. - [x] Pause auto deploys: `/chatops run auto_deploy pause` ##### Start a deployment pipeline containing the new Ruby version - [x] Check out either CNG or Omnibus repository on your local machine - [x] Find the latest auto-deploy branch: `18-7-auto-deploy-2025120311` - [x] Run `git commit --allow-empty -m "Empty commit to trigger a new auto-deploy pkg"` to add an extra commit to the Omnibus or CNG auto deploy branch. * [x] https://gitlab.com/gitlab-org/security/omnibus-gitlab/-/commit/4d60ff68445a61a0670276481a966b0ffb1ac64e - [x] Trigger new packager pipelines (Omnibus + CNG) by running the `MANUAL auto-deploy pick&tag` inactive manual scheduled pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipeline_schedules/. Make a note of the tag of the created pipeline. * [x] https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/5265832 * [x] Tag: * [x] CNG: 18.7.202512031113+b6c5d38b1d1 * [x] Omnibus: 18.7.202512031113+b6c5d38b1d1.4d60ff68445 - [x] Temporarily gain release-manager permissions for sufficient privilege in the packager pipelines: https://gitlab.com/gitlab-org/release/docs/-/blob/master/release_manager/index.md#temporary-permissions - [x] Cancel the Omnibus and CNG packager pipelines created for the tag noted in the previous step * https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines?scope=tags&page=1 * https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines?scope=tags&page=1 - [x] Make sure NEXT_RUBY_VERSION is updated to 3.3 in https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/master/config/software/ruby.rb. If not, please ping the Build team * [x] https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/dd1cfcb6f6f39dccd7c775de6c25f3a64dcd7009/config/software/ruby.rb#L31 - [x] Use the following links to start the new packager pipelines. Make sure that the tag is the same as the cancelled pipelines above. * https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines/new?var%5BUSE_NEXT_RUBY_VERSION_IN_AUTODEPLOY%5D=true * Result: https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines/413611 * https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines/new?var%5BUSE_NEXT_RUBY_VERSION_IN_AUTODEPLOY%5D=true * Result: https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines/413610 * [x] Start a new deployment pipeline for the above package with `/chatops run auto_deploy pipeline 18.7.202512031113-b6c5d38b1d1.4d60ff68445` ##### Rollout to gstg-cny - [x] Notify Slack channels `#development`, `#backend`, `#frontend` and `#staging-ref` that the deployment has started. - [x] Keep an eye on the auto-deploy pipeline and do not allow it to be deployed beyond staging-canary. Cancel the `validate_ownership:gstg-cny` job to prevent a deployment to gprd-cny from starting. - [x] Notify [engineers helping with monitoring](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20888#engineer-helping-monitor-the-test-rollout) in `#f_ruby3` when the deployment to `gstg-cny` is done, so they can proceed with [monitoring](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20888#monitoring) `gstg-cny` . Tell them to write their finding in the [Monitoring thread](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20918#note_2930858956). ##### Baking in gstg-cny - [x] Bake in gstg-cny until monitoring is over. We expect this to take about 2 hours. - [x] Green light from engineers helping with monitoring to proceed with deployment in gprd-cny. ##### Rollout to gprd-cny - [x] Unlock `gprd-cny`: `/chatops run deploy unlock gprd-cny` - [x] Restart the previously cancelled auto-deploy job to proceed with the deployment to `gprd-cny`. - [x] Notify [engineers helping with monitoring](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20888#engineer-helping-monitor-the-test-rollout) in `#f_ruby3` when the deployment to `gprd-cny` is done, so they can proceed with [monitoring](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20888#monitoring) `gprd-cny` . Tell them to write their finding in the [Monitoring thread](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20918#note_2930858956). ##### Baking in gprd-cny - [x] Bake in gprd-cny until monitoring is over. We expect this to take about 2 hours. - [x] Green light from engineers helping with monitoring to proceed with auto-deployment ##### Continuing with auto-deployment Since the previous pipelines were manually created with local pipeline variables, every other auto-deployment package will be created with Ruby 3.2. Once the monitoring is over, we can proceed with auto-deployments. Note: we will **not** promote/deploy the same auto-deploy package with Ruby 3.3 to staging and production. - [x] Let `@sre-oncall` and `@release-managers` know in `#production` that the CR is completed and auto deployments will resume. - [x] Resume auto-deployments: `/chatops run auto_deploy unpause` - [x] Set label ~22757328 `/label ~change::complete` ## Rollback ### Rollback steps - steps to be taken in the event of a need to rollback this change _Estimated Time to Complete (mins)_ - N/A At any point during the test rollout to `gstg-cny`, if it is deemed unsuccessful by the monitoring engineer, we will not deploy to `gprd-cny`. This package is not meant to be deployed beyond `gprd-cny`. Whether or not the test rollout to `gprd-cny` is deemed successful or not, we will be continuing the auto-deployment process with a package built with Ruby 3.2 (every other package than the one we manually create during these rollout steps). In any of the cases, we will not be rolling back the deployment of the Ruby 3.3 package. In case of a broken production canary deploy, make sure to [drain the canary traffic](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/canary.md#how-to-stop-all-production-traffic-to-canary) by running the following command in `#production`: ``` /chatops run canary --disable --production ``` Once a successful production canary deployment has been completed, re-enable canary traffic by: ``` /chatops run canary --enable --production ``` ## Monitoring ### Key metrics to observe * Dashboards/metrics: * Monitor the following dashboards for unhealthy dip in service health for the environment/cluster that is being rolled out. * [Deployment health](https://dashboards.gitlab.net/d/delivery-deployment_health/delivery-deployment-health?orgId=1), configurable with environment, stage, and type/service * [Kubernetes compute resource/cluster health](https://dashboards.gitlab.net/d/kubernetes-resources-cluster/kubernetes-compute-resources-cluster?orgId=1&refresh=5m), configurable with clusters * [Kubernetes compute resource/pods health](https://dashboards.gitlab.net/d/kubernetes-resources-namespace/kubernetes-compute-resources-namespace-pods?orgId=1&refresh=5m), configurable with clusters and namespace * [Kubernetes networking](https://dashboards.gitlab.net/d/kubernetes-cluster-total/kubernetes-networking-cluster?orgId=1&refresh=5m), configurable with clusters * Per-service dashboards (change `env` and `stage` to toggle between `gstg`/`gprd` and `main`/`cny`): * `api` ([overview](https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=now-1h&to=now), [containers](https://dashboards.gitlab.net/d/api-kube-containers/api-kube-containers-detail?from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&orgId=1)) * `web` ([overview](https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&from=now-1h&to=now), [containers](https://dashboards.gitlab.net/d/web-kube-containers/web-kube-containers-detail?from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&orgId=1)) * `websockets` ([overview](https://dashboards.gitlab.net/d/websockets-main/websockets-overview?orgId=1&from=now-1h&to=now), [containers](https://dashboards.gitlab.net/d/websockets-kube-containers/websockets-kube-containers-detail?from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&orgId=1)) * `git` ([overview](https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1&from=now-1h&to=now), [containers](https://dashboards.gitlab.net/d/git-kube-containers/git-kube-containers-detail?from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&orgId=1)) * `sidekiq` ([overview](https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-1h&to=now), [containers](https://dashboards.gitlab.net/d/sidekiq-kube-containers/sidekiq-kube-containers-detail?from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&orgId=1)) * Kibana - Puma (edit `json.type` to filter by service, `json.stage` for `cny` vs `main`) * [Production 5xx responses](https://log.gprd.gitlab.net/goto/e0d9a290-b8c9-11ed-85ed-e7557b0a598c) * [Staging 5xx responses](https://nonprod-log.gitlab.net/goto/040ba510-b8ca-11ed-9af2-6131f0ee4ce6) * Kibana - Sidekiq (edit `json.shard` to switch between job types) * [Failed production jobs](https://log.gprd.gitlab.net/goto/89320700-b813-11ed-9f43-e3784d7fe3ca) * [Failed staging jobs](https://nonprod-log.gitlab.net/goto/e2744200-b814-11ed-9af2-6131f0ee4ce6) * Sentry * [Production overview](https://sentry.gitlab.net/gitlab/gitlabcom/dashboard/?statsPeriod=1h) * [Staging overview](https://sentry.gitlab.net/gitlab/staginggitlabcom/dashboard/?statsPeriod=1h) * QA runs can be observed via Slack: * `#announcements` - Besides QA messages, multiple messages are sent to this channel to account for the different deployments. * QA slack channels - There is a channel per environment, for example, a failure on gstg and gstg-cny will be posted in `#qa-staging`, a failure on gprd-cny and gprd will be posted in `#qa-production`, etc. * Dealing with deploy failures: https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/failures.md
issue