2024-09-04: Test Rollout of Ruby 3.2 to gstg-cny and gprd-cny

Production Change

Change Summary

The plan is to test switching auto-deploys to use Ruby 3.2 and perform the deployment of a Ruby 3.2 package to gitlab.com staging-canary (gstg-cny) and production-canary (gprd-cny) on 3rd September 2024 AMER time.

During the change duration, auto-deploys will need to be paused.

Change Details

  1. Services Impacted - GitLab Rails, Sidekiq, and any other service that uses Ruby
  2. Change Technician - @jennykim-gitlab
  3. Change Reviewer - @rpereira2
  4. Time tracking - 420 minutes (7 hours)
  5. Downtime Component - none
  6. Start time - 15:30 UTC

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

No need for setting Maintenance mode.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 420 minutes

Rollout

We will follow the steps documented under https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/ruby-upgrades.md#test-rollouts to perform the test rollout

Pause auto deploys
  • Let @sre-oncall and @release-managers know in #production that the CR is starting and auto deployments will be paused.
Start a deployment pipeline containing the new Ruby version
Rollout to gstg-cny
  • Notify Slack channels #development, #backend, #frontend and #staging-ref that the deployment has started.
  • Keep an eye on the auto-deploy pipeline and do not allow it to be deployed beyond staging-canary. Cancel the validate_ownership:gstg-cny job to prevent a deployment to gprd-cny from starting.
    • I've run /chatops run deploy lock gprd-cny in #f_upcoming_release just to make sure
  • Notify @mokhax and engineers helping with monitoring in #f_ruby3 when the deployment to gstg-cny is done, so they can proceed with Staging Canary: Monitor experimental Ruby 3.2 p... (gitlab-org/gitlab#481752 - closed)
Baking in gstg-cny
  • Bake in gstg-cny until monitoring is over. We expect this to take about 2 hours.
  • Green light from engineers helping with monitoring to proceed with deployment in gprd-cny.
Rollout to gprd-cny
Baking in gprd-cny
  • Bake in gprd-cny until monitoring is over. We expect this to take about 2 hours.
  • Green light from engineers helping with monitoring to proceed with auto-deployment
Continuing with auto-deployment

Since the previous pipelines were manually created with local pipeline variables, every other auto-deployment package will be created with Ruby 3.1. Once the monitoring is over, we can proceed with auto-deployments.

Note: we will not promote/deploy the same auto-deploy package with Ruby 3.2 to staging and production.

  • Let @sre-oncall and @release-managers know in #production that the CR is completed and auto deployments will resume.
  • Resume auto-deployments: /chatops run auto_deploy unpause
  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - N/A

Refer to #18470 (closed)

At any point during the test rollout to gstg-cny, if it is deemed unsuccessful by the grouppackage registry, we will not deploy to gprd-cny.

This package is not meant to be deployed beyond gprd-cny. Whether or not the test rollout to gprd-cny is deemed successful or not, we will be continuing the auto-deployment process with a package built with Ruby 3.1 (every other package than the one we manually create during these rollout steps).

In any of the cases, we will not be rolling back the deployment of the Ruby 3.2 package.

In case of a broken production canary deploy, make sure to drain the canary traffic by running the following command in #production:

/chatops run canary --disable --production

Once a successful production canary deployment has been completed, re-enable canary traffic by:

/chatops run canary --enable --production

Monitoring

Monitoring will be done mainly on gitlab-org/gitlab#481752 (closed) by grouppackage registry

Key metrics to observe

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Jenny Kim