Sidekiq ScheduledSet problems
In recent sidekiq incidents 1 and 2 part of the problem was a significant backlog in the ScheduledSet (jobs were delayed by an hour or more) with at once case 1.8M jobs and growing until we took positive remedial actions.
The mitigation was to lower the number of catchall workers, reducing the CPU time spent in brpop thus freeing up CPU time in Redis for the Scheduled Set processing (zrangebyscore and zrem, mostly). While we're going to help generally mitigate that in the near future with work in &447 (closed) and &469 (closed), there is still some concern about the scalability of the scheduled set processing. In particular, during these incidents the scheduled set backlog processing seemed to be the cause of not only CPU saturation on the core thread, but high CPU use by the IO threads. This relationship is complicated at the best of times given the way IO threads are implemented in Redis, and I'm not convinced we fully understood the nature of the behavior (being focused on restoring normal operations at the time).
This pattern of behavior suggests we may be on the edge for being to handle this set, being OK normally but with no visibility that we are on that edge until we hit a tipping point and simply can't keep up. #1171 (closed) will add visibility, but it may be valuable to spend some time investigating what the limits are (perhaps in an artificial environment), and whether there are ways we can make Sidekiq handle these better. First step would be to review the Sidekiq code, write it up for human review, and reason about possible limits.
Distinctly but related, we currently have almost 700 jobs in the Scheduled Set due to be executed anywhere from 2025 to 3021 (all CreateEvidenceWorker in the samples taken). This seems sub-optimal, so while doing this we should also look into where these came from and whether we should just delete them (or otherwise handle them), and whether we should alert/take more action when these show up. To be clear, they're not causing any measurable trouble that I can see, so it's more a matter of data hygiene while we're looking.