Migrate the urgent-other sidekiq shard to Kubernetes
C3
Production Change - Criticality 3Change Component | Description |
---|---|
Change Objective | Describe the objective of the change |
Change Type | ConfigurationChange|HotFix|DeploymentNewFeature|Operation |
Services Impacted | List services |
Change Team Members | Name of the engineers involved in the change |
Change Criticality | C3 |
Change Reviewer or tested in staging | A colleague who will review the change or evidence the change was tested on staging environment |
Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
Due Date | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Enables the low-urgency-cpu-bound
shard in production:
Monitoring
- Dashboard: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=urgent-other
- Logs: https://log.gprd.gitlab.net/goto/98232115616824f596325016fcc6a041
- Exceptions in logs: https://log.gprd.gitlab.net/goto/7ddc6367a345719d7243b24da9f74296
Detailed steps for the change
-
2020-06-09 Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!261 (merged) -
Monitor logs and the shard dashboard (above) -
2020-06-10 Stop sidekiq-cluster on 50% of the VM infrastructure for this shard -
2020-06-10 Stop sidekiq-cluster on 100% of the VM infrastructure for this shard -
Remove VM infrastructure for this shard https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1832
Rollback steps
-
Start sidekiq-cluster on the VMs
knife ssh 'roles:gprd-base-be-sidekiq-urgent-other' 'sudo gitlab-ctl start sidekiq-cluster'
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents
Edited by John Jarvis