Improve performance and accuracy of tracking concurrent limits

Context

The following discussion from !165079 (merged) should be addressed:

@DylanGriffith started a discussion: (+3 comments)

I'm curious how expensive this tally operation is. I believe we get all of this data from Sidekiq Redis. How many queries is it?

Specifically this comment: !165079 (comment 2114165789)

One concern I have with the existing approach is that it seems to me we're looping over all Sidekiq processes and by the time we get to the last process we've probably got out of date information on the first process. Maybe that's fine if this is a statistical process but I also have to imagine there is a more efficient way of keeping these counts across the full fleet without doing all these expensive Redis checks.

Ideas

Keep a tally ourselves by writing into a hash of sidekiq-process-id -> worker class. Getting a snapshot is just a single hgetall. But the trouble lies in failure handling, e.g. if a sidekiq-process is removed during a deployment, how do we clean up the entry? We could sscan processes set then read out fields in the hash with periodic clean ups 🤔
In the server middleware, write to a redis key redis.set("gitlab:#{Thread.current[:sidekiq_capsule].identity}:#{sidekiq_tid}:job_class", job_class, ex: ttl). To retrieve:
- Run sscan("processes") to get a list of pid (sidekiq's internal process id)
- Run a pipeline of hkeys("#{pid}:work") to get an array of array of tid
- Group pid with its array of tids
- Run mget(...) across all process ids + tid combination.
- Tally list of worker classes

We avoid running up to Sidekiq.load_json + Gitlab::Json.parse on the thousands of job hashes. The ttl handles the clean up on each deployment when process and thread ids are rotated (or we could do a unlink when the job is done).

Note that we use sidekiq_tid in structured logger (https://gitlab.com/gitlab-org/gitlab/-/blob/fb34e2c3539d91178ff55e50c2363c7611023af3/lib/gitlab/sidekiq_logging/structured_logger.rb#L115).

Approach

We landed on the use of Redis hashes discussed in #490936 (comment 2124532871)

Closing summary

The approach we took was to use a Redis hash per worker class to track the number of concurrent jobs being executed. The ConcurrencyLimit::ResumeWorker performs a periodic clean up of dangling jobs.

There is a dashboard at https://dashboards.gitlab.net/d/sidekiq-concurrency/sidekiq3a-worker-concurency-detail?orgId=1 which shows the worker concurrency and job queue. The middleware is now able to keep to the concurrency limits more closely and respond faster when a concurrency limit is exceeded.

See outcome in #490936 (comment 2170423984)

Future works

Discussed briefly in #490936 (comment 2170440452). We could track concurrency for all workers to help better estimate concurrency values.

Tracking concurrency of all workers help would also be beneficial if we were to throttle Sidekiq workers using the current concurrency as a reference. E.g. instead of setting concurrency levels to a fixed level, we could throttle by setting concurrency limit of a worker to a percentage of its recent concurrency levels.

Edited Nov 01, 2024 by Sylvester Chin