Migrate the urgent-cpu-bound sidekiq shard to Kubernetes

Production Change - Criticality 3 C3

Change Component	Description
Change Objective	Migrate the urgent-cpu-bound sidekiq shard to Kubernetes
Change Type	ConfigurationChange
Services Impacted	ServiceSidekiq sidekiq_shardUrgentCpuBound
Change Team Members	@skarbek
Change Criticality	C3
Change Reviewer or tested in staging	@jarv STAGING ✅
Dry-run output	n/a, CI automation and shell one-liners will complete most tasks
Due Date	2020-07-14 16:00 UTC
Time tracking	2.5 hours

Moves the urgent-cpu-bound shard in production from VMs to Kubernetes. We'll do this by first merging in the configuration that enables this shard on Kubernetes. This will immediately spin up Pods which will start to process items in the queues for which this shard is responsible for. We let this bake in for a bit, monitoring the Pods for erroneous behavior. After some time has passed, we'll then stop sidekiq on the VMs in a controlled manner.

Monitoring

Detailed steps for the change

Merge: gitlab-com/gl-infra/k8s-workloads/gitlab-com!282 (merged)
Ensure the above change is propagated to production
Check the above monitoring for any drop in apdex, increase in errors, or exceptions specifically on the Kubernetes Infrastructure
- If problems do occur, immediately roll back procedure 1 below
Wait 1 hour
Move forward only when we've validated that we've not negatively impacted error ratios, apdex, or seen an increase in errors in logs
Stop Sidekiq on the VM's for this shard incrementally: knife ssh 'roles:gprd-base-be-sidekiq-urgent-cpu-bound' 'sleep 600; sudo gitlab-ctl stop sidekiq-cluster' -C 1
- While the shutdown's occur, watch the metrics and logs above
- We expect sidekiq to start processing more jobs in Kubernetes
- If at any point, something violates the apdex or error rates breach thresholds, immediately stop the above and start up the VM's (in Rollback procedure 2 below)

Rollback steps

Procedure 1

⚠ If for any reason we cannot wait for a revert to happen, see the Emergency steps below ⚠

Revert: gitlab-com/gl-infra/k8s-workloads/gitlab-com!282 (merged)
Ensure the revert is applied to the production infrastructure
- Doing so will terminate the Deployment of this shard

Procedure 2

Start up sidekiq on the Virtual Machines: knife ssh 'roles:gprd-base-be-sidekiq-urgent-cpu-bound' 'sudo gitlab-ctl start'
Determine if the Rollback Procedure 1 should be executed
- This can be determined by continuing to observe the same metrics and logs as mentioned above
- Mildly subjective, but if error rates or apdex do not improve, proceed to revert - doing so will allow a calm collected mind to perform a mindful retrospective and determine future actions

Emergency Rollback Procedure

Ensure all VM's are online as outlined in Procedure 2
Stop Pods immediately:
- ssh console-01-sv-gprd.c.gitlab-production.internal
- kubectl scale deploy gitlab-sidekiq-urgent-cpu-bound-v1 -n gitlab --replicas 0
- This will immediately start to Terminate the running Pods, validate this via kubectl get pods -n gitlab
Proceed with Procedure 1

Summary of infrastruture changes

Does this change introduce new compute instances? yes
Does this change re-size any existing compute instances? no
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? yes

We are moving the sidekiq shard from our Virtual Machines using omnibus installations into Kubernetes, using our Cloud Native Installation. This migration removes Virtual Machines from processing jobs, which also removes them from chef when they are shut down. As Pods start to pick up work, we should observe no change in apdex or error ratios as sidekiq should already be configured to process the same set of work as the VM's.

During the migration, the tooling used for troubleshooting changes. Prior we'd ssh into machines to change the behavior of this shard, however, after we've migrated over to Kubernetes, we'll need to utilize kubectl to perform any investigative work. Outside of troubleshooting/investigation, our CI tooling should suffice.

VM's will be cleaned up as part of issue: delivery#971 (closed)

Changes checklist

Detailed steps and rollback steps have been filled prior to commencing work
SRE on-call has been informed prior to change being rolled out
There are currently no active incidents

Edited Jul 14, 2020 by John Skarbek