2024-09-04: Test Rollout of Ruby 3.2 to gstg-cny and gprd-cny
Production Change
Change Summary
The plan is to test switching auto-deploys to use Ruby 3.2 and perform the deployment of a Ruby 3.2 package to gitlab.com staging-canary (gstg-cny) and production-canary (gprd-cny) on 3rd September 2024 AMER time.
During the change duration, auto-deploys will need to be paused.
Change Details
- Services Impacted - GitLab Rails, Sidekiq, and any other service that uses Ruby
-
Change Technician -
@jennykim-gitlab - Change Reviewer - @rpereira2
- Time tracking - 420 minutes (7 hours)
- Downtime Component - none
- Start time - 15:30 UTC
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and per our runbooks. This will make sure unset maintenance mode for the maintenance period.SLA calculations adjust
No need for setting Maintenance mode.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 420 minutes
-
Set label changein-progress /label ~change::in-progress
Rollout
We will follow the steps documented under https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/ruby-upgrades.md#test-rollouts to perform the test rollout
Pause auto deploys
-
Let @sre-oncalland@release-managersknow in#productionthat the CR is starting and auto deployments will be paused.
Start a deployment pipeline containing the new Ruby version
-
Run git commit --allow-empty -m "Empty commit to trigger a new auto-deploy pkg"to add an extra commit to the Omnibus or CNG auto deploy branch. https://gitlab.com/gitlab-org/security/omnibus-gitlab/-/commit/57c86ab9474ca347ad84299803db213efea97611 -
Trigger a deployment pipeline by running the MANUAL auto-deploy pick&taginactive manual scheduled pipeline: https://ops.gitlab.net/gitlab-org/release/tools/-/pipeline_schedules/. Make a note of the tag of the created pipeline. -
Temporarily gain release-manager permissions for sufficient privilege in the packager pipelines: https://gitlab.com/gitlab-org/release/docs/-/blob/master/release_manager/index.md#temporary-permissions -
Pause auto deploys: /chatops run auto_deploy pause -
Cancel the Omnibus and CNG packager pipelines created for the tag noted in the previous step -
Use the following links to start the new packager pipelines. Make sure gitlab-org/omnibus-gitlab!7898 (merged) is merged. Make sure that the tag is the same as the cancelled pipelines above.
Rollout to gstg-cny
-
Notify Slack channels #development,#backend,#frontendand#staging-refthat the deployment has started. -
Keep an eye on the auto-deploy pipeline and do not allow it to be deployed beyond staging-canary. Cancel the validate_ownership:gstg-cnyjob to prevent a deployment to gprd-cny from starting.- I've run
/chatops run deploy lock gprd-cnyin#f_upcoming_releasejust to make sure
- I've run
-
Notify @mokhaxand engineers helping with monitoring in#f_ruby3when the deployment togstg-cnyis done, so they can proceed with Staging Canary: Monitor experimental Ruby 3.2 p... (gitlab-org/gitlab#481752 - closed)
Baking in gstg-cny
-
Bake in gstg-cny until monitoring is over. We expect this to take about 2 hours. -
Green light from engineers helping with monitoring to proceed with deployment in gprd-cny.
Rollout to gprd-cny
-
Unlock gprd-cny:/chatops run deploy unlock gprd-cny -
Restart the previously cancelled auto-deploy job to proceed with the deployment to gprd-cny. -
Notify @mokhaxand engineers helping with monitoring in#f_ruby3when the deployment togprd-cnyis done, so they can proceed with Prod Canary: Monitor experimental Ruby 3.2 pack... (gitlab-org/gitlab#482005 - closed)
Baking in gprd-cny
-
Bake in gprd-cny until monitoring is over. We expect this to take about 2 hours. -
Green light from engineers helping with monitoring to proceed with auto-deployment
Continuing with auto-deployment
Since the previous pipelines were manually created with local pipeline variables, every other auto-deployment package will be created with Ruby 3.1. Once the monitoring is over, we can proceed with auto-deployments.
Note: we will not promote/deploy the same auto-deploy package with Ruby 3.2 to staging and production.
-
Let @sre-oncalland@release-managersknow in#productionthat the CR is completed and auto deployments will resume. -
Resume auto-deployments: /chatops run auto_deploy unpause -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - N/A
Refer to #18470 (closed)
At any point during the test rollout to gstg-cny, if it is deemed unsuccessful by the grouppackage registry, we will not deploy to gprd-cny.
This package is not meant to be deployed beyond gprd-cny. Whether or not the test rollout to gprd-cny is deemed successful or not, we will be continuing the auto-deployment process with a package built with Ruby 3.1 (every other package than the one we manually create during these rollout steps).
In any of the cases, we will not be rolling back the deployment of the Ruby 3.2 package.
In case of a broken production canary deploy, make sure to drain the canary traffic by running the following command in #production:
/chatops run canary --disable --production
Once a successful production canary deployment has been completed, re-enable canary traffic by:
/chatops run canary --enable --production
Monitoring
Monitoring will be done mainly on gitlab-org/gitlab#481752 (closed) by grouppackage registry
Key metrics to observe
- Dashboards/metrics:
- Monitor the following dashboards for unhealthy dip in service health for the environment/cluster that is being rolled out.
- Deployment health, configurable with environment, stage, and type/service
- Kubernetes compute resource/cluster health, configurable with clusters
- Kubernetes compute resource/pods health, configurable with clusters and namespace
- Kubernetes networking, configurable with clusters
- Per-service dashboards (change
envandstageto toggle betweengstg/gprdandmain/cny):-
api(overview, containers) -
web(overview, containers) -
websockets(overview, containers) -
git(overview, containers) -
sidekiq(overview, containers)
-
- Kibana - Puma (edit
json.typeto filter by service,json.stageforcnyvsmain) - Kibana - Sidekiq (edit
json.shardto switch between job types) - Sentry
- QA runs can be observed via Slack:
-
#announcements- Besides QA messages, multiple messages are sent to this channel to account for the different deployments. - QA slack channels - There is a channel per environment, for example, a failure on gstg and gstg-cny will be posted in
#qa-staging, a failure on gprd-cny and gprd will be posted in#qa-production, etc.
-
- Dealing with deploy failures: https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/deploy/failures.md
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.