Improve performance and accuracy of tracking concurrent limits
Context
The following discussion from !165079 (merged) should be addressed:
-
@DylanGriffith started a discussion: (+3 comments) I'm curious how expensive this
tally
operation is. I believe we get all of this data from Sidekiq Redis. How many queries is it?
Specifically this comment: !165079 (comment 2114165789)
One concern I have with the existing approach is that it seems to me we're looping over all Sidekiq processes and by the time we get to the last process we've probably got out of date information on the first process. Maybe that's fine if this is a statistical process but I also have to imagine there is a more efficient way of keeping these counts across the full fleet without doing all these expensive Redis checks.
Ideas
-
Keep a tally ourselves by writing into a hash of
sidekiq-process-id -> worker class
. Getting a snapshot is just a single hgetall. But the trouble lies in failure handling, e.g. if a sidekiq-process is removed during a deployment, how do we clean up the entry? We couldsscan
processes
set then read out fields in the hash with periodic clean ups🤔 -
In the server middleware, write to a redis key
redis.set("gitlab:#{Thread.current[:sidekiq_capsule].identity}:#{sidekiq_tid}:job_class", job_class, ex: ttl)
. To retrieve:- Run
sscan("processes")
to get a list ofpid
(sidekiq's internal process id) - Run a pipeline of
hkeys("#{pid}:work")
to get an array of array oftid
- Group pid with its array of
tid
s - Run
mget(...)
across all process ids + tid combination. - Tally list of worker classes
- Run
We avoid running up to Sidekiq.load_json
+ Gitlab::Json.parse
on the thousands of job hashes. The ttl handles the clean up on each deployment when process and thread ids are rotated (or we could do a unlink
when the job is done).
Note that we use sidekiq_tid
in structured logger (https://gitlab.com/gitlab-org/gitlab/-/blob/fb34e2c3539d91178ff55e50c2363c7611023af3/lib/gitlab/sidekiq_logging/structured_logger.rb#L115).
Approach
We landed on the use of Redis hashes discussed in #490936 (comment 2124532871)
Closing summary
The approach we took was to use a Redis hash per worker class to track the number of concurrent jobs being executed. The ConcurrencyLimit::ResumeWorker
performs a periodic clean up of dangling jobs.
There is a dashboard at https://dashboards.gitlab.net/d/sidekiq-concurrency/sidekiq3a-worker-concurency-detail?orgId=1 which shows the worker concurrency and job queue. The middleware is now able to keep to the concurrency limits more closely and respond faster when a concurrency limit is exceeded.
See outcome in #490936 (comment 2170423984)
Future works
Discussed briefly in #490936 (comment 2170440452). We could track concurrency for all workers to help better estimate concurrency values.
Tracking concurrency of all workers help would also be beneficial if we were to throttle Sidekiq workers using the current concurrency as a reference. E.g. instead of setting concurrency levels to a fixed level, we could throttle by setting concurrency limit of a worker to a percentage of its recent concurrency levels.