Skip to content

Route selected catchall sidekiq jobs to the default queue (Group 4, take 2) - code_testing/continuous_delivery/subgroups/authentication_and_authorization/gitaly/issue_tracking/requirements_management (40 queues)

Production Change

Change Summary

This is take 2 of: #5195 (closed)


To reduce CPU pressure on the Sidekiq Redis cluster, we are moving to a one-queue-per-shard configuration (rather than one-queue-per-worker). This change is an end-game step on that path and routes "real" work from the catchall shard to the default queue.

For this iteration we'll be migrating jobs for a small number of moderately busy feature categories. These account for another 40 queues and about 10% of the job count on catchall in k8s:

  • code_testing
  • continuous_delivery
  • subgroups
  • authentication_and_authorization
  • gitaly
  • issue_tracking
  • requirements_management

Change Details

  1. Services Impacted - ServiceRedis ServiceSidekiq
  2. Change Technician - @cmiskell / @msmiley
  3. Change Reviewer - @smcgivern / @msmiley
  4. Time tracking - 1.5hr
  5. Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 2 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 2 hr

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1.5hrs

  • Create commits to revert the MRs that have been merged to that point and apply them. If urgent (system breakage) apply to chef manually on affected nodes (web + sidekiq) with knife ssh with high concurrency

Monitoring

Key metrics to observe

Phase 1

Phase 2

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

N/A

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Matt Smiley