Route all remaining Sidekiq jobs to the default queue (Group 8)
Production Change
Change Summary
To reduce CPU pressure on the Sidekiq Redis cluster, we are moving to a one-queue-per-shard configuration (rather than one-queue-per-worker). This change is an end-game step on that path and routes "real" work from the catchall shard to the default queue.
For this iteration we'll be migrating jobs the final jobs on catchall in k8s.
This accounts for 4 queues in 4 remaining feature categories that were either added or had their categories changed after we started this work:
container_network_security
dependency_proxy
error_tracking
git_lfs
Change Details
- Services Impacted - ServiceRedis ServiceSidekiq
- Change Technician - @cmiskell
- Change Reviewer - @msmiley
- Time tracking - 1.5hr
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 2 minutes
- Obtain approval on chef + k8s MRs that adjust the routing:
-
Set label changein-progress on this issue -
Silence any alerts with label queue="default"
-
Check that a deployment is not in progress, since disabling chef-client could disrupt it. See slack channel announcements for deployer status.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 2 hr
- Chef validation:
-
Stop chef on all rails VMs: knife ssh -C10 'roles:gprd-base-fe-web OR roles:gprd-base-be-sidekiq' "sudo chef-client-disable 'Production 5327'"
-
Merged https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/435 and wait for the apply_to_production job to complete on ops - Verify one web node (
web-cny-01-sv-gprd.c.gitlab-production.internal
)-
Run chef: sudo chef-client-enable; sudo chef-client
. This should run gitlab-ctl reconfigure and restart puma -
Verify puma starts correctly and is not in a crash loop. Check the puma processes for start time over several minutes (confirm no restarts), and tail /var/log/gitlab/puma/puma_stdout.log -
Verify that traffic is being served successfully by tailing /var/log/gitlab/gitlab-rails/production_json.log
; if puma is failing to start, this will only contain failing healthcheck requests, and any sign of application requests being processed is positive. For extra care, pipe to|jq .status
and verify the requests are generally successful (2xx, 3xx).
-
-
Run chef on all rails VMs at a controlled concurrency: knife ssh -C3 'roles:gprd-base-fe-web OR roles:gprd-base-be-sidekiq' "sudo chef-client-enable; sudo chef-client"
-
- While chef is still running on all remaining nodes, continue in parallel with kubernetes:
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!1114 (merged) - will apply automatically, in around 40 minutes or so. -
Observe the Phase 1 metrics + logs as described in Monitoring below while the updates are rolling out; effects should build over the run as routing changes take effect
-
-
Migrate scheduled/retry-set jobs to the default queue. On the console node, after ensuring chef has run to pick up the new rules, run this rake task: sudo gitlab-rake gitlab:sidekiq:migrate_jobs:schedule gitlab:sidekiq:migrate_jobs:retry
-
Verify that all the re-routed jobs are running in default
not their per-worker queue. -
Merge the MR that stops sidekiq explicitly listening to the per-worker queues: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1115 (merged) -
Observe the Phase 2 metric as described in Monitoring below.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1.5hrs
-
Create commits to revert the MRs that have been merged to that point and apply them. If urgent (system breakage) apply to chef manually on affected nodes (web + sidekiq) with knife ssh
with high concurrency
Monitoring
Key metrics to observe
Phase 1
- Metric: Rate of jobs in the
default
queue, in logs. This should gradually increase as VMs and pods are reconfigured to use the new routing rules.- Location: https://log.gprd.gitlab.net/goto/9d88051b13805465f4a13fbb3b63ecfb
- What changes to this metric should prompt a rollback: Not seeing jobs for the new feature categories being processed through
default
. It's not an immediate disaster, but does warrant halting while finding an explanation.
- Metric: Default queue details
- Location: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=default
- What changes to this metric should prompt a rollback: Not seeing similar results to the logs, observing excessive error rates, or any queuing (jobs not being processed). We should see a gradual growth in the absolute metrics (RPS etc); expecting 4-5 times RPS
- Metric: Catchall shard details
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=catchall
- What changes to this metric should prompt a rollback: All catchall jobs should still run on this shard, so any deviation from normal bounds is problematic. For deviation, consider activity before the change started and any periodic patterns (particularly daily cycles of activity). Also check carefully that any changes to graphs aren't a result of unexpected metrics anomalies (i.e. the measuring is wrong, not the actual effect); there may be edge cases we have missed.
- Metric: Job execution locations
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(rate(sidekiq_jobs_completion_seconds_count%7Benv%3D%22gprd%22%2C%20shard%3D%22catchall%22%2C%20pod!%3D%22%22%2C%20queue!~%22mailers%7Cproject_import_schedule%7Cservice_desk_email_receiver%22%7D%5B1m%5D))%20by%20(queue%2C%20worker)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: Some entire job disappearing. Expected is to see jobs migrate from
queue=<worker_name>
toqueue=default
for the sameworker
label value. - NB: add
queue="default
to the query to see just jobs executing on the new queue, and negate it to confirm they have stopped executing in the old queues.
Phase 2
- Metric: Redis Sidekiq CPU saturation
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=1217942947&orgId=1 - the yellow "redis_primary_cpu_component" line
- What changes to this metric should prompt a rollback: This may drop a little, and definitely shouldn't rise. Note that the variance/range of normal values is quite high (e.g. can vary between 50-80% over 5-10 minutes), so any apparent changes would need to be sustained to declare it a problem
- Metric: Catchall shard details
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=catchall
- What changes to this metric should prompt a rollback: Primarily keep an eye on queuing, to ensure that we do not have any orphaned jobs that are being scheduled on their old work queue that nothing is listening to.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
N/A
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Craig Miskell