Increase Sidekiq BRPOP timeout from 2 to 5 seconds
Production Change
Change Summary
Set the SIDEKIQ_SEMI_RELIABLE_FETCH_TIMEOUT
to 5 on Sidekiq nodes.
This will hopefully help alleviate some of the pressure caused by CPU saturation on redis-sidekiq in #4049 (closed) by reducing the overhead that setting up and tearing down connections causes
Change Details
- Services Impacted - ServiceSidekiq
- Change Technician - @igorwwwwwwwwwwwwwwwwwwww @reprazent
- Change Criticality - C3,
- Change Type - changeunscheduled, changescheduled
- Change Reviewer - @reprazent
- Due Date - Depends on gitlab-org/gitlab!57351 (merged) getting to production
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Make sure gitlab-org/gitlab!57351 (merged) is available on production -
Get flamegraphs of the redis-sidekiq before the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5238 that increases the timeout in staging. -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!763 (merged) that increases the timeout for k8s workers in staging -
Report back here -
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5240 that increases the timeout for catchall -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!763 (merged) that increases the timeout for k8s workers in gprd
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Verify CPU saturation on https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&viewPanel=68 -
Get flamegraph of primary after the change
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert aforementioned MRs -
Rollback Step 2 -
Rollback Step 3
Monitoring
Key metrics to observe
- Metric: redis-sidekiq-cpu saturation
- Metric: Sidekiq apdex, error ratio, and RPS
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Matt Smiley