Skip to content

Migrate the urgent-cpu-bound sidekiq shard to Kubernetes

Production Change - Criticality 3 C3

Change Component Description
Change Objective Migrate the urgent-cpu-bound sidekiq shard to Kubernetes
Change Type ConfigurationChange
Services Impacted ServiceSidekiq sidekiq_shardUrgentCpuBound
Change Team Members @skarbek
Change Criticality C3
Change Reviewer or tested in staging @jarv STAGING
Dry-run output n/a, CI automation and shell one-liners will complete most tasks
Due Date 2020-07-14 16:00 UTC
Time tracking 2.5 hours

Moves the urgent-cpu-bound shard in production from VMs to Kubernetes. We'll do this by first merging in the configuration that enables this shard on Kubernetes. This will immediately spin up Pods which will start to process items in the queues for which this shard is responsible for. We let this bake in for a bit, monitoring the Pods for erroneous behavior. After some time has passed, we'll then stop sidekiq on the VMs in a controlled manner.

Monitoring

Detailed steps for the change

  • Merge: gitlab-com/gl-infra/k8s-workloads/gitlab-com!282 (merged)
  • Ensure the above change is propagated to production
  • Check the above monitoring for any drop in apdex, increase in errors, or exceptions specifically on the Kubernetes Infrastructure
    • If problems do occur, immediately roll back procedure 1 below
  • Wait 1 hour
  • Move forward only when we've validated that we've not negatively impacted error ratios, apdex, or seen an increase in errors in logs
  • Stop Sidekiq on the VM's for this shard incrementally: knife ssh 'roles:gprd-base-be-sidekiq-urgent-cpu-bound' 'sleep 600; sudo gitlab-ctl stop sidekiq-cluster' -C 1
    • While the shutdown's occur, watch the metrics and logs above
    • We expect sidekiq to start processing more jobs in Kubernetes
    • If at any point, something violates the apdex or error rates breach thresholds, immediately stop the above and start up the VM's (in Rollback procedure 2 below)

Rollback steps

Procedure 1

If for any reason we cannot wait for a revert to happen, see the Emergency steps below

Procedure 2

  • Start up sidekiq on the Virtual Machines: knife ssh 'roles:gprd-base-be-sidekiq-urgent-cpu-bound' 'sudo gitlab-ctl start'
  • Determine if the Rollback Procedure 1 should be executed
    • This can be determined by continuing to observe the same metrics and logs as mentioned above
    • Mildly subjective, but if error rates or apdex do not improve, proceed to revert - doing so will allow a calm collected mind to perform a mindful retrospective and determine future actions

Emergency Rollback Procedure

  • Ensure all VM's are online as outlined in Procedure 2
  • Stop Pods immediately:
    • ssh console-01-sv-gprd.c.gitlab-production.internal
    • kubectl scale deploy gitlab-sidekiq-urgent-cpu-bound-v1 -n gitlab --replicas 0
    • This will immediately start to Terminate the running Pods, validate this via kubectl get pods -n gitlab
  • Proceed with Procedure 1

Summary of infrastruture changes

  • Does this change introduce new compute instances? yes
  • Does this change re-size any existing compute instances? no
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? yes

We are moving the sidekiq shard from our Virtual Machines using omnibus installations into Kubernetes, using our Cloud Native Installation. This migration removes Virtual Machines from processing jobs, which also removes them from chef when they are shut down. As Pods start to pick up work, we should observe no change in apdex or error ratios as sidekiq should already be configured to process the same set of work as the VM's.

During the migration, the tooling used for troubleshooting changes. Prior we'd ssh into machines to change the behavior of this shard, however, after we've migrated over to Kubernetes, we'll need to utilize kubectl to perform any investigative work. Outside of troubleshooting/investigation, our CI tooling should suffice.

VM's will be cleaned up as part of issue: delivery#971 (closed)

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • SRE on-call has been informed prior to change being rolled out
  • There are currently no active incidents
Edited by John Skarbek