Route selected catchall sidekiq jobs to the default queue (Group 1) - feature category not_owned

Production Change

Change Summary

To reduce CPU pressure on the Sidekiq Redis cluster, we are moving to a one-queue-per-shard configuration (rather than one-queue-per-worker). This change is an end-game step on that path and routes "real" work from the catchall shard to the default queue.

For this iteration we'll be migrating jobs in the 'not_owned' category, which are:

  • object_storage:object_storage_background_move - not actively used in staging or production
  • object_storage:object_storage_migrate_uploads - only used when migrating from legacy (disk/DB) storage to object storage
  • delete_stored_files - used, but rare (handful of times a day)
  • external_service_reactive_caching - common; order of 1-8 per second depending on time of day
  • flush_counter_increments - behind a feature flag (efficient_counter_attribute), but beginning to be tested (gitlab-org/gitlab#238535 (closed)) so may show up
  • chaos:* - Technically included, but also all tagged exclude_from_gitlab_com so already routed to default in scalability#1072

NB: reactive_caching is not included because while in the not_owned category it has resource boundaries that mean it executes on the low-urgency-cpu-bound shard.

Change Details

  1. Services Impacted - ServiceRedis ServiceSidekiq
  2. Change Technician - @cmiskell
  3. Change Reviewer - @smcgivern
  4. Time tracking - 2hr
  5. Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 2 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 2 hr

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1.5hrs

  • Create commits to revert the MRs that have been merged to that point and apply them. If urgent (system breakage) apply to chef manually on affected nodes (web + sidekiq) with knife ssh with high concurrency

Monitoring

Key metrics to observe

Phase 1

Phase 2

  • Metric: Redis Sidekiq CPU saturation

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Rachel Nienaber