Improved Sidekiq troubleshooting strategies

Summary

Raised to document some of the issues we're seeing and to ask for help from groupmemory

We seem to be getting an increasing number of Sidekiq related performance tickets.

For example (ticket links for GitLab team members)

[1] Merge requests get stuck "checking pipeline status"; cannot press 'merge'
- significant underlying issue with WebHookWorker jobs
- Increased generic Sidekiq workers from one to five, and added a dedicated WebHookWorker process along side existing repository_update_mirror and project_import_schedule. This seems to have resolved it.
[2] Long delays before pipelines start
[3] Pipelines and mirroring getting stuck
[4] Ongoing issues; multiple previous tickets. Sidekiq related performance degrades after about a week since last Sidekiq restart.
- It's possible they're running into: gitaly#4113 (closed)
- They are running into: #362914 - 14RPS / 70% of sidekiq jobs by volume
[5] Pipelines stuck in created or running states.
- They had concurrency set to 175; reducing this to 10 resolved this. But, I don't think we could measure what was going wrong.
[6] Widgets spinning in MRs
- Possible issue with reactive caching; we didn't get to the bottom of this.

Sometimes these tickets are resolved by adding more queue_groups.

Other times, customers already have more queue groups, but perhaps they're quite specific. We don't have a way to measure if the defined queue groups are a good use of resources - a Sidekiq worker will typically use 1-1.5gb RAM.

Reference architecture guidance is to define generic workers

sidekiq['queue_groups'] = ['*'] * 2

❓ How can we measure [a] that something more specific is needed, and if it's live already [b] that it's effective?

Two questions which arose on ticket [1]

❓ Do threads in a sidekiq process share anything which each other, aside from CPU, and could potentially create internal bottlenecks?
- Some kind of networking bottleneck owing to TCP timeouts, for ecample.
- Customer had a very large volume of failing webhooks - some failing with TLS errors, some timing out in TCP as the target was offline. 20 RPS. 77% of their sidekiq jobs by volume.
- In one set of SOS dumps (12th July) we analyzed there was no notable queuing occurring. So, either there wasn't a problem when those SOS reports were running, or we're missing a way to measure why jobs are running slowly. We need to analyze their 8th July SOS reports further, as there is significant queuing occurring.
- The duration of all WebHookWorker jobs was ~78.9% of the total duration of all jobs in the logs, which is what one might expect or hope - the environment was "broken" and generating vast quantities of WebHookWorker logs, but Sidekiq (as configured) was processing all jobs equally slow/fast.
I noticed that a Sidekiq worker with only one queue assigned wouldn't show max_concurrency as the limit of the busy count (thread count?) where as a generic worker does:
```
   # max_concurrency 25
sidekiq 5.2.9 gitlab-rails [9 of 25 busy]
   # one thread (plus a spare)
sidekiq 5.2.9 queues:project_import_schedule [0 of 2 busy]
```
- ❓ In any given sidekiq worker, will a given queue only be able to use a single thread of the configured max_concurrency ? If so, this would suggest that, with max_concurrency set to 10, the maximum throughput one queue could achieve is 10% of the work that process is able to do (with all other queues fighting over the remaining nine threads)
Ticket [1] they report that if a MR is stuck "Checking if merge request can be merged..." then closing and re-opening the MR would immediately fix this.
- I guess this has the effect of queuing up fresh sidekiq jobs for the async mergeability checks.
- ❓ What could have happened to the original sidekiq job(s)?

Tooling

I think we're missing the ability to capture the state of a Sidekiq environment - in particular the state of queues. Being able to get this information captured when customers see issues would be useful.

I suspect we can get this from the Sidekiq API but at present the Rails code we have seems very queue specific, vs. querying and summarizing all queues.

Steps to reproduce

Example Project

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

Edited Jul 25, 2022 by Ben Prescott_