Migrate the urgent-cpu-bound sidekiq shard to Kubernetes
Production Change - Criticality 3 C3
| Change Component | Description |
|---|---|
| Change Objective | Migrate the urgent-cpu-bound sidekiq shard to Kubernetes |
| Change Type | ConfigurationChange |
| Services Impacted | ServiceSidekiq sidekiq_shardUrgentCpuBound |
| Change Team Members | @skarbek |
| Change Criticality | C3 |
| Change Reviewer or tested in staging |
@jarv STAGING |
| Dry-run output | n/a, CI automation and shell one-liners will complete most tasks |
| Due Date | 2020-07-14 16:00 UTC |
| Time tracking | 2.5 hours |
Moves the urgent-cpu-bound shard in production from VMs to Kubernetes. We'll do this by first merging in the configuration that enables this shard on Kubernetes. This will immediately spin up Pods which will start to process items in the queues for which this shard is responsible for. We let this bake in for a bit, monitoring the Pods for erroneous behavior. After some time has passed, we'll then stop sidekiq on the VMs in a controlled manner.
Monitoring
- Shard Dashboard: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=urgent-cpu-bound
- Sidekiq Overview Dashboard: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- Logs: https://log.gprd.gitlab.net/goto/7d367df9d7cb86e0fcc91b1b98a82aee
- Exceptions in logs: https://log.gprd.gitlab.net/goto/392cc20efcb784835b08b47cdcf93230
Detailed steps for the change
-
Merge: gitlab-com/gl-infra/k8s-workloads/gitlab-com!282 (merged) -
Ensure the above change is propagated to production -
Check the above monitoring for any drop in apdex, increase in errors, or exceptions specifically on the Kubernetes Infrastructure - If problems do occur, immediately roll back procedure 1 below
-
Wait 1 hour -
Move forward only when we've validated that we've not negatively impacted error ratios, apdex, or seen an increase in errors in logs -
Stop Sidekiq on the VM's for this shard incrementally: knife ssh 'roles:gprd-base-be-sidekiq-urgent-cpu-bound' 'sleep 600; sudo gitlab-ctl stop sidekiq-cluster' -C 1- While the shutdown's occur, watch the metrics and logs above
- We expect sidekiq to start processing more jobs in Kubernetes
- If at any point, something violates the apdex or error rates breach thresholds, immediately stop the above and start up the VM's (in Rollback procedure 2 below)
Rollback steps
Procedure 1
-
Revert: gitlab-com/gl-infra/k8s-workloads/gitlab-com!282 (merged) -
Ensure the revert is applied to the production infrastructure - Doing so will terminate the Deployment of this shard
Procedure 2
-
Start up sidekiq on the Virtual Machines: knife ssh 'roles:gprd-base-be-sidekiq-urgent-cpu-bound' 'sudo gitlab-ctl start' -
Determine if the Rollback Procedure 1 should be executed - This can be determined by continuing to observe the same metrics and logs as mentioned above
- Mildly subjective, but if error rates or apdex do not improve, proceed to revert - doing so will allow a calm collected mind to perform a mindful retrospective and determine future actions
Emergency Rollback Procedure
-
Ensure all VM's are online as outlined in Procedure 2 -
Stop Pods immediately: ssh console-01-sv-gprd.c.gitlab-production.internalkubectl scale deploy gitlab-sidekiq-urgent-cpu-bound-v1 -n gitlab --replicas 0- This will immediately start to Terminate the running Pods, validate this via
kubectl get pods -n gitlab
-
Proceed with Procedure 1
Summary of infrastruture changes
-
Does this change introduce new compute instances? yes -
Does this change re-size any existing compute instances? no -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? yes
We are moving the sidekiq shard from our Virtual Machines using omnibus installations into Kubernetes, using our Cloud Native Installation. This migration removes Virtual Machines from processing jobs, which also removes them from chef when they are shut down. As Pods start to pick up work, we should observe no change in apdex or error ratios as sidekiq should already be configured to process the same set of work as the VM's.
During the migration, the tooling used for troubleshooting changes. Prior we'd ssh into machines to change the behavior of this shard, however, after we've migrated over to Kubernetes, we'll need to utilize kubectl to perform any investigative work. Outside of troubleshooting/investigation, our CI tooling should suffice.
VM's will be cleaned up as part of issue: delivery#971 (closed)
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents