Route selected catchall sidekiq jobs to the default queue (Group 1) - feature category not_owned

Production Change

Change Summary

To reduce CPU pressure on the Sidekiq Redis cluster, we are moving to a one-queue-per-shard configuration (rather than one-queue-per-worker). This change is an end-game step on that path and routes "real" work from the catchall shard to the default queue.

For this iteration we'll be migrating jobs in the 'not_owned' category, which are:

object_storage:object_storage_background_move - not actively used in staging or production
object_storage:object_storage_migrate_uploads - only used when migrating from legacy (disk/DB) storage to object storage
delete_stored_files - used, but rare (handful of times a day)
external_service_reactive_caching - common; order of 1-8 per second depending on time of day
flush_counter_increments - behind a feature flag (efficient_counter_attribute), but beginning to be tested (gitlab-org/gitlab#238535 (closed)) so may show up
chaos:* - Technically included, but also all tagged exclude_from_gitlab_com so already routed to default in scalability#1072

NB: reactive_caching is not included because while in the not_owned category it has resource boundaries that mean it executes on the low-urgency-cpu-bound shard.

Change Details

Services Impacted - ServiceRedis ServiceSidekiq
Change Technician - @cmiskell
Change Reviewer - @smcgivern
Time tracking - 2hr
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 2 minutes

Obtain approval on chef + k8s MRs that adjust the routing:
Set label changein-progress on this issue
Silence any alerts with label queue="default"

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 2 hr

In parallel, merge and apply the MRs that route selected jobs to the default queue
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/202
  - In production, wait 35-40 minutes for all nodes to apply the change; manual execution will not be substantially faster.
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!978 (merged) - will apply automatically, in around 40 minutes or so.
- Observe the Phase 1 metrics + logs as described in Monitoring below while the updates are rolling out; effects should build over the run as routing changes take effect
Migrate scheduled/retry-set jobs for this grouping to the default queue. On the console node, run this rake task:
- sudo gitlab-rake gitlab:sidekiq:migrate_jobs:schedule gitlab:sidekiq:migrate_jobs:retry
Verify that all the re-routed jobs are running in default not their per-worker queue. Note that before this change the default queue is not visible because there's no jobs being scheduled on it so no metrics, but that should pick up work as this change deploys, while the older metrics fall away.
- https://thanos-query.ops.gitlab.net/graph?g0.range_input=12h&g0.max_source_resolution=0s&g0.expr=sum(rate(sidekiq_jobs_completion_seconds_count%7Benv%3D%22gprd%22%2C%20feature_category%3D%22not_owned%22%2C%20shard%3D%22catchall%22%7D%5B1m%5D))%20by%20(queue%2C%20worker)&g0.tab=0
Merge and apply the MR that stops sidekiq explicitly listening to the per-worker queues: gitlab-com/gl-infra/k8s-workloads/gitlab-com!979 (merged)
Observe the Phase 2 metric as described in Monitoring below.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1.5hrs

Create commits to revert the MRs that have been merged to that point and apply them. If urgent (system breakage) apply to chef manually on affected nodes (web + sidekiq) with knife ssh with high concurrency

Monitoring

Key metrics to observe

Phase 1

Metric: Rate of jobs in the default queue, in logs. This should gradually increase as VMs and pods are reconfigured to use the new routing rules.
- Location: https://log.gprd.gitlab.net/goto/54ca58153ec8d9aa48a0d1493fd6f5cd
- What changes to this metric should prompt a rollback: No sign of any jobs appearing in the default queue. It doesn't require an immediate rollback, but is a sign that the change hasn't had the desired effect.
Metric: Default queue details
- Location: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=default
- What changes to this metric should prompt a rollback: Not seeing similar results to the logs, observing excessive error rates, or any queuing (jobs not being processed). We should see a gradual growth in the absolute metrics (RPS etc), which may translate into some wobbliness in ratios/apdex during the very early stages when there are few jobs (a few legitimate failures could cause a high apparent rate of error)
Metric: Catchall shard details
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=catchall
- What changes to this metric should prompt a rollback: All catchall jobs should still run on this shard, so any deviation from normal bounds is problematic. For deviation, consider activity before the change started and any periodic patterns (particularly daily cycles of activity). Also check carefully that any changes to graphs aren't a result of unexpected metrics anomalies (i.e. the measuring is wrong, not the actual effect); there may be edge cases we have missed.
Metric: Job execution locations
- Location: https://thanos-query.ops.gitlab.net/graph?g0.range_input=12h&g0.max_source_resolution=0s&g0.expr=sum(rate(sidekiq_jobs_completion_seconds_count%7Benv%3D%22gprd%22%2C%20feature_category%3D%22not_owned%22%2C%20shard%3D%22catchall%22%7D%5B1m%5D))%20by%20(queue%2C%20worker)&g0.tab=0
- What changes to this metric should prompt a rollback: Some entire job disappearing. Expected is to see jobs migrate from queue=<worker_name> to queue=default for the same worker label value.

Phase 2

Metric: Redis Sidekiq CPU saturation
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=1217942947&orgId=1 - the yellow "redis_primary_cpu_component" line
- What changes to this metric should prompt a rollback: We expect this to drop, although the magnitude of change may be small. It definitely shouldn't rise. Note that the variance/range of normal values is quite high (e.g. can vary between 50-80% over 5-10 minutes), so any apparent changes would need to be sustained to declare it a problem

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jul 20, 2021 by Rachel Nienaber

Assignee Loading

Time tracking Loading