Route selected catchall sidekiq jobs to the default queue (Group 1) - feature category not_owned
Production Change
Change Summary
To reduce CPU pressure on the Sidekiq Redis cluster, we are moving to a one-queue-per-shard configuration (rather than one-queue-per-worker). This change is an end-game step on that path and routes "real" work from the catchall shard to the default queue.
For this iteration we'll be migrating jobs in the 'not_owned' category, which are:
- object_storage:object_storage_background_move - not actively used in staging or production
- object_storage:object_storage_migrate_uploads - only used when migrating from legacy (disk/DB) storage to object storage
- delete_stored_files - used, but rare (handful of times a day)
- external_service_reactive_caching - common; order of 1-8 per second depending on time of day
- flush_counter_increments - behind a feature flag (efficient_counter_attribute), but beginning to be tested (gitlab-org/gitlab#238535 (closed)) so may show up
- chaos:* - Technically included, but also all tagged exclude_from_gitlab_com so already routed to default in scalability#1072
NB: reactive_caching is not included because while in the not_owned category it has resource boundaries that mean it executes on the low-urgency-cpu-bound shard.
Change Details
- Services Impacted - ServiceRedis ServiceSidekiq
- Change Technician - @cmiskell
- Change Reviewer - @smcgivern
- Time tracking - 2hr
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 2 minutes
- Obtain approval on chef + k8s MRs that adjust the routing:
-
Set label changein-progress on this issue -
Silence any alerts with label queue="default"
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 2 hr
- In parallel, merge and apply the MRs that route selected jobs to the default queue
-
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/202 - In production, wait 35-40 minutes for all nodes to apply the change; manual execution will not be substantially faster.
-
gitlab-com/gl-infra/k8s-workloads/gitlab-com!978 (merged) - will apply automatically, in around 40 minutes or so. -
Observe the Phase 1 metrics + logs as described in Monitoring below while the updates are rolling out; effects should build over the run as routing changes take effect
-
-
Migrate scheduled/retry-set jobs for this grouping to the default queue. On the console node, run this rake task: sudo gitlab-rake gitlab:sidekiq:migrate_jobs:schedule gitlab:sidekiq:migrate_jobs:retry
-
Verify that all the re-routed jobs are running in defaultnot their per-worker queue. Note that before this change thedefaultqueue is not visible because there's no jobs being scheduled on it so no metrics, but that should pick up work as this change deploys, while the older metrics fall away. -
Merge and apply the MR that stops sidekiq explicitly listening to the per-worker queues: gitlab-com/gl-infra/k8s-workloads/gitlab-com!979 (merged) -
Observe the Phase 2 metric as described in Monitoring below.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1.5hrs
-
Create commits to revert the MRs that have been merged to that point and apply them. If urgent (system breakage) apply to chef manually on affected nodes (web + sidekiq) with knife sshwith high concurrency
Monitoring
Key metrics to observe
Phase 1
- Metric: Rate of jobs in the
defaultqueue, in logs. This should gradually increase as VMs and pods are reconfigured to use the new routing rules.- Location: https://log.gprd.gitlab.net/goto/54ca58153ec8d9aa48a0d1493fd6f5cd
- What changes to this metric should prompt a rollback: No sign of any jobs appearing in the
defaultqueue. It doesn't require an immediate rollback, but is a sign that the change hasn't had the desired effect.
- Metric: Default queue details
- Location: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=default
- What changes to this metric should prompt a rollback: Not seeing similar results to the logs, observing excessive error rates, or any queuing (jobs not being processed). We should see a gradual growth in the absolute metrics (RPS etc), which may translate into some wobbliness in ratios/apdex during the very early stages when there are few jobs (a few legitimate failures could cause a high apparent rate of error)
- Metric: Catchall shard details
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=catchall
- What changes to this metric should prompt a rollback: All catchall jobs should still run on this shard, so any deviation from normal bounds is problematic. For deviation, consider activity before the change started and any periodic patterns (particularly daily cycles of activity). Also check carefully that any changes to graphs aren't a result of unexpected metrics anomalies (i.e. the measuring is wrong, not the actual effect); there may be edge cases we have missed.
- Metric: Job execution locations
- Location: https://thanos-query.ops.gitlab.net/graph?g0.range_input=12h&g0.max_source_resolution=0s&g0.expr=sum(rate(sidekiq_jobs_completion_seconds_count%7Benv%3D%22gprd%22%2C%20feature_category%3D%22not_owned%22%2C%20shard%3D%22catchall%22%7D%5B1m%5D))%20by%20(queue%2C%20worker)&g0.tab=0
- What changes to this metric should prompt a rollback: Some entire job disappearing. Expected is to see jobs migrate from
queue=<worker_name>toqueue=defaultfor the sameworkerlabel value.
Phase 2
- Metric: Redis Sidekiq CPU saturation
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=1217942947&orgId=1 - the yellow "redis_primary_cpu_component" line
- What changes to this metric should prompt a rollback: We expect this to drop, although the magnitude of change may be small. It definitely shouldn't rise. Note that the variance/range of normal values is quite high (e.g. can vary between 50-80% over 5-10 minutes), so any apparent changes would need to be sustained to declare it a problem
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Rachel Nienaber