Phased rollout of uptycs to production
C2
Production Change - Criticality 2Change Objective | Describe the objective of the change |
---|---|
Change Type | Rollout of uptycs client to all production hosts - this change tracks all steps. Starting with a minimal set and ramping up after observation time. |
Services Impacted | All services. |
Change Team Members | @dawsmith, @pharrison |
Change Severity | C2 - since this touches all hosts, but earlier testing has shown minimal impact. |
Buddy check | @Finotto, @sdval @glopezfernandez |
Tested in staging | yes |
Schedule of the change | Starting week of March 4. Further details below. |
Duration of the change | 2 weeks. |
Detailed steps for the change. Each step must include: | - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step |
Related issues: Infra issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6272 Staging testing: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5784
Background:
In order to rollout the change, we will need to add the gitlab-uptycs cookbook to the runlist of sets of hosts for which to run uptycs. This was done in staging by adding "recipe[gitlab-uptycs]"
to the runlists of various roles.
Phase 0.1 - rollout to all of staging Remove gitlab-uptycs from roles (gstg-base-be-sidekiq-import) Add recipe[gitlab-uptycs] to the gstg-base role and gstg-infra role
Let this bake for a day or 2 Compare graphs for the environment and monitor during that time.
Let this bake for a day or 2 and check Grafana for load/cpu usage: https://dashboards.gitlab.net/d/llfd4b2ik/canary?orgId=1
Phase 1 -- role out to ops
- Add recipe[gitlab-uptycs] to the runlist for the role ops-base
Phase 2 -- role out to infra
- Add recipe[gitlab-uptycs] to the runlist for the role gprd-infra
Phase 3 -- add recipe[gitlab-uptycs] to the runlist the canary and sidekiqroles to test further, part 1:
- gprd-base-be-sidekiq
- gprd-base-fe-git-cny
- gprd-base-fe-api-cny
- gprd-base-fe-web-cny
Phase 4 -- specifically check DB
- gprd-base-db
Watch dashboards on: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&var-environment=gprd&var-prometheus=Global&var-type=patroni
Phase 5 -- remove the runlist override from the roles in Phase 3 and 4.
Phase 6 -- prod:
- Add recipe[gitlab-uptycs] to the runlist for the role gprd-base
- Add recipe[gitlab-uptycs] to the runlist for the role dr-base
Let prod bake for 1-2 days before DR again watching CPU usage.
Backout plan - in each phase:
- Remove recipe[gitlab-uptycs] from the node or role for which it was edited.
- Uninstall uptycs -- cc @pharrison for do we have a howto?