2024-09-04: Test Rollout of Ruby 3.2 to gstg-cny and gprd-cny

Production Change

Change Summary

The plan is to test switching auto-deploys to use Ruby 3.2 and perform the deployment of a Ruby 3.2 package to gitlab.com staging-canary (gstg-cny) and production-canary (gprd-cny) on 3rd September 2024 AMER time.

During the change duration, auto-deploys will need to be paused.

Change Details

Services Impacted - GitLab Rails, Sidekiq, and any other service that uses Ruby
Change Technician - @jennykim-gitlab
Change Reviewer - @rpereira2
Time tracking - 420 minutes (7 hours)
Downtime Component - none
Start time - 15:30 UTC

Set Maintenance Mode in GitLab

~~If your change involves scheduled maintenance, add a step to set and ~~unset maintenance mode~~ per our runbooks. This will make sure ~~SLA calculations adjust~~ for the maintenance period.~~

No need for setting Maintenance mode.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 420 minutes

Set label changein-progress /label ~change::in-progress

Rollout

We will follow the steps documented under https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/ruby-upgrades.md#test-rollouts to perform the test rollout

Pause auto deploys

Let @sre-oncall and @release-managers know in #production that the CR is starting and auto deployments will be paused.

Start a deployment pipeline containing the new Ruby version

Run git commit --allow-empty -m "Empty commit to trigger a new auto-deploy pkg" to add an extra commit to the Omnibus or CNG auto deploy branch. https://gitlab.com/gitlab-org/security/omnibus-gitlab/-/commit/57c86ab9474ca347ad84299803db213efea97611
Trigger a deployment pipeline by running the MANUAL auto-deploy pick&tag inactive manual scheduled pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipeline_schedules/. Make a note of the tag of the created pipeline.
- tag: https://ops.gitlab.net/gitlab-org/release/tools/-/tags/17.4.202409051530
- auto-deploy pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/3672190
Temporarily gain release-manager permissions for sufficient privilege in the packager pipelines: https://gitlab.com/gitlab-org/release/docs/-/blob/master/release_manager/index.md#temporary-permissions
Pause auto deploys: /chatops run auto_deploy pause
Cancel the Omnibus and CNG packager pipelines created for the tag noted in the previous step
- https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines?scope=tags&page=1
  - https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines/343366
- https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines?scope=tags&page=1
  - https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines/343365
Use the following links to start the new packager pipelines. Make sure gitlab-org/omnibus-gitlab!7898 (merged) is merged. Make sure that the tag is the same as the cancelled pipelines above.
- https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines/new?var%5BUSE_NEXT_RUBY_VERSION_IN_AUTODEPLOY%5D=true
  - https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines/343370
- https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines/new?var%5BUSE_NEXT_RUBY_VERSION_IN_AUTODEPLOY%5D=true
  - https://dev.gitlab.org/gitlab/charts/components/images/-/pipelines/343371

Rollout to gstg-cny

Notify Slack channels #development, #backend, #frontend and #staging-ref that the deployment has started.
Keep an eye on the auto-deploy pipeline and do not allow it to be deployed beyond staging-canary. Cancel the validate_ownership:gstg-cny job to prevent a deployment to gprd-cny from starting.
- I've run /chatops run deploy lock gprd-cny in #f_upcoming_release just to make sure
Notify @mokhax and engineers helping with monitoring in #f_ruby3 when the deployment to gstg-cny is done, so they can proceed with Staging Canary: Monitor experimental Ruby 3.2 p... (gitlab-org/gitlab#481752 - closed)

Baking in gstg-cny

Bake in gstg-cny until monitoring is over. We expect this to take about 2 hours.
Green light from engineers helping with monitoring to proceed with deployment in gprd-cny.

Rollout to gprd-cny

Unlock gprd-cny: /chatops run deploy unlock gprd-cny
Restart the previously cancelled auto-deploy job to proceed with the deployment to gprd-cny.
Notify @mokhax and engineers helping with monitoring in #f_ruby3 when the deployment to gprd-cny is done, so they can proceed with Prod Canary: Monitor experimental Ruby 3.2 pack... (gitlab-org/gitlab#482005 - closed)

Baking in gprd-cny

Bake in gprd-cny until monitoring is over. We expect this to take about 2 hours.
Green light from engineers helping with monitoring to proceed with auto-deployment

Continuing with auto-deployment

Since the previous pipelines were manually created with local pipeline variables, every other auto-deployment package will be created with Ruby 3.1. Once the monitoring is over, we can proceed with auto-deployments.

Note: we will not promote/deploy the same auto-deploy package with Ruby 3.2 to staging and production.

Let @sre-oncall and @release-managers know in #production that the CR is completed and auto deployments will resume.
Resume auto-deployments: /chatops run auto_deploy unpause
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - N/A

Refer to #18470 (closed)

At any point during the test rollout to gstg-cny, if it is deemed unsuccessful by the grouppackage registry, we will not deploy to gprd-cny.

This package is not meant to be deployed beyond gprd-cny. Whether or not the test rollout to gprd-cny is deemed successful or not, we will be continuing the auto-deployment process with a package built with Ruby 3.1 (every other package than the one we manually create during these rollout steps).

In any of the cases, we will not be rolling back the deployment of the Ruby 3.2 package.

In case of a broken production canary deploy, make sure to drain the canary traffic by running the following command in #production:

/chatops run canary --disable --production

Once a successful production canary deployment has been completed, re-enable canary traffic by:

/chatops run canary --enable --production

Monitoring

Monitoring will be done mainly on gitlab-org/gitlab#481752 (closed) by grouppackage registry

Key metrics to observe

Dashboards/metrics:
- Monitor the following dashboards for unhealthy dip in service health for the environment/cluster that is being rolled out.
- Deployment health, configurable with environment, stage, and type/service
- Kubernetes compute resource/cluster health, configurable with clusters
- Kubernetes compute resource/pods health, configurable with clusters and namespace
- Kubernetes networking, configurable with clusters
- Per-service dashboards (change env and stage to toggle between gstg/gprd and main/cny):
  - api (overview, containers)
  - web (overview, containers)
  - websockets (overview, containers)
  - git (overview, containers)
  - sidekiq (overview, containers)
- Kibana - Puma (edit json.type to filter by service, json.stage for cny vs main)
  - Production 5xx responses
  - Staging 5xx responses
- Kibana - Sidekiq (edit json.shard to switch between job types)
  - Failed production jobs
  - Failed staging jobs
- Sentry
  - Production overview
  - Staging overview
QA runs can be observed via Slack:
- #announcements - Besides QA messages, multiple messages are sent to this channel to account for the different deployments.
- QA slack channels - There is a channel per environment, for example, a failure on gstg and gstg-cny will be posted in #qa-staging, a failure on gprd-cny and gprd will be posted in #qa-production, etc.
Dealing with deploy failures: https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/failures.md

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Sep 06, 2024 by Jenny Kim