Route selected catchall sidekiq jobs to the default queue (Group 4, take 2) - code_testing/continuous_delivery/subgroups/authentication_and_authorization/gitaly/issue_tracking/requirements_management (40 queues)

Production Change

Change Summary

To reduce CPU pressure on the Sidekiq Redis cluster, we are moving to a one-queue-per-shard configuration (rather than one-queue-per-worker). This change is an end-game step on that path and routes "real" work from the catchall shard to the default queue.

For this iteration we'll be migrating jobs for a small number of moderately busy feature categories. These account for another 40 queues and about 10% of the job count on catchall in k8s:

code_testing
continuous_delivery
subgroups
authentication_and_authorization
gitaly
issue_tracking
requirements_management

Change Details

Services Impacted - ServiceRedis ServiceSidekiq
Change Technician - @cmiskell / @msmiley
Change Reviewer - @smcgivern / @msmiley
Time tracking - 1.5hr
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 2 minutes

Obtain approval on chef + k8s MRs that adjust the routing:
Set label changein-progress on this issue
Silence any alerts with label queue="default"
Check that a deployment is not in progress, since disabling chef-client could disrupt it. See slack channel announcements for deployer status.

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 2 hr

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1.5hrs

Create commits to revert the MRs that have been merged to that point and apply them. If urgent (system breakage) apply to chef manually on affected nodes (web + sidekiq) with knife ssh with high concurrency

Monitoring

Key metrics to observe

Phase 1

Metric: Rate of jobs in the default queue, in logs. This should gradually increase as VMs and pods are reconfigured to use the new routing rules.
- Location: https://log.gprd.gitlab.net/goto/1a74fc278ff856bf31ffaa9a5a7ed199
- What changes to this metric should prompt a rollback: Not seeing jobs for the new feature categories being processed through default. It's not an immediate disaster, but does warrant halting while finding an explanation.
Metric: Default queue details
- Location: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=default
- What changes to this metric should prompt a rollback: Not seeing similar results to the logs, observing excessive error rates, or any queuing (jobs not being processed). We should see a gradual growth in the absolute metrics (RPS etc); expecting 4-5 times RPS
Metric: Catchall shard details
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=catchall
- What changes to this metric should prompt a rollback: All catchall jobs should still run on this shard, so any deviation from normal bounds is problematic. For deviation, consider activity before the change started and any periodic patterns (particularly daily cycles of activity). Also check carefully that any changes to graphs aren't a result of unexpected metrics anomalies (i.e. the measuring is wrong, not the actual effect); there may be edge cases we have missed.
Metric: Job execution locations
- Location: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum(rate(sidekiq_jobs_completion_seconds_count%7Benv%3D%22gprd%22%2C%20feature_category%3D~%22code_testing%7Ccontinuous_delivery%7Csubgroups%7Cauthentication_and_authorization%7Cgitaly%7Cissue_tracking%7Crequirements_management%22%2C%20shard%3D%22catchall%22%2C%20pod!%3D%22%22%7D%5B1m%5D))%20by%20(queue%2C%20worker)&g0.tab=0
- What changes to this metric should prompt a rollback: Some entire job disappearing. Expected is to see jobs migrate from queue=<worker_name> to queue=default for the same worker label value.
- NB: add queue="default to the query to see just jobs executing on the new queue, and negate it to confirm they have stopped executing in the old queues.

Phase 2

Metric: Redis Sidekiq CPU saturation
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=1217942947&orgId=1 - the yellow "redis_primary_cpu_component" line
- What changes to this metric should prompt a rollback: This may drop a little, and definitely shouldn't rise. Note that the variance/range of normal values is quite high (e.g. can vary between 50-80% over 5-10 minutes), so any apparent changes would need to be sustained to declare it a problem
Metric: Catchall shard details
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=catchall
- What changes to this metric should prompt a rollback: Primarily keep an eye on queuing, to ensure that we do not have any orphaned jobs that are being scheduled on their old work queue that nothing is listening to.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

N/A

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
There are currently no active incidents.

Edited Aug 05, 2021 by Matt Smiley