Sidekiq middleware for dropping duplicate idempotent jobs from a queue
The middleware from #165 (closed) should be extended to drop jobs if the worker is marked as idempotent and a job with the same identity is already enqueued.
It should be possible to disable dropping the jobs using a feature flag.
Retries should run, even if the there was a duplicate already enqueued.
Original proposal:
During the "October 13 / Sunday night Crypto Miner Limit Takedown" incident, (RCA https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8153) a very poorly performing SQL query led to two queues pipeline_processing:stage_update
and pipeline_cache:expire_job_cache
congest with over one hundred thousand jobs in each queue.
Since the SQL query in these jobs was leading to CPU saturation on the Postgres primary, and each query was taking several seconds to complete, the backload quickly built up.
After a while, it became clear that the arguments for the job in this queue were in fact identical, and since these jobs are idempotent (ie, calling the job more than once has the same effect as calling it a single time) most of the processing happening was completely wasteful.
Proposal
-
Step 1: Explicitly mark jobs as being idempotent. This could follow the pattern established in the Sidekiq attribution work being carried out in gitlab-org/gitlab!18462 (merged). As @reprazent has pointed out, Sidekiq workers should all be idempotent, but in reality, GitLab's are not.
class MyWorker include ApplicationWorker idempotent def perform() # .... end end
- We should start with
pipeline_cache:expire_job_cache
as it's already idempotent and was mentioned in the incident.
- We should start with
-
Step 2: add a client side middleware that checks whether a job is idempotent, and has identical arguments/class to an already enqueued job, and if it is enqueued, not running, drop the new job.
- Having this in place would have reduced the backlog of 100k+ jobs down to a handful.
- This approach has a built-in negative feedback loop, which helps make our queueing infrastructure less fragile: when an idempotent job starts taking longer and backing up a queue, there is a higher likelihood of it being dropped as a duplicate. This has an effect of dampening down poorly performing idempotent jobs.
Details
Duplicate check mechanism
Sidekiq implements its queues as redis lists (unsurprisingly!). Checking the queues for duplicated would be an O(n)
operation and too expensive for production usage. Instead, I propose a client-side and server-side middleware combination, with the following roles. This approach is O(1)
so should not have a big impact on production performance:
-
On client-side job creation: (client-middleware)
- Middleware checks whether a job is marked as idempotent, if not it passes through as normal
- The middleware calculates the "idempotency string" for the job. This containings the worker class and the arguments for the job.
- The middleware calculates the "idempotency hash" for the job. This is simply a hash of the "idempotency string" using SHA256 or other.
- The middleware will use this hash as an address of a Redis hash. Using a hash instead of the full idempotency string keeps the keys short while avoiding the (incredibly unlikely, but possible) chance of hash collisions.
- The middleware queuries the Redis hash :
HEXISTS gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string>
- If the result exists, the middleware silently drops the job and returns to the client
- If the result does not exist, the middleware, continues with the following steps:
- Adds the key to the hash:
HSET gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string> 1
- Sets a TTL on they hash to one day (this is a safety clean-up mechanism which would only be required during incidents):
EXPIRE gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string> 86400
- The middleware now continues as normal
-
On the server-side: (server-middleware)
- First three steps are the same as the client side:
- Middleware checks whether a job is marked as idempotent, if not it passes through as normal
- The middleware calculates the "idempotency string" for the job.
- The middleware calculates the "idempotency hash" for the job.
- The middleware removes the Redis hash :
HDEL gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string>
- The middleware now continues as normal
This approach will drop duplicates but maintain performance. For some workers such as pipeline_processing:stage_update
, and especially during times of bad performance (due to pg, redis or Gitaly issue), this could lead to a big reduction in sidekiq activity, helping reduce the impact of the incident.
Guarantees
It's important that this middleware is not treated by developers as a guarantee that the job will be dropped. There are race conditions that may lead to jobs being added twice (as it would be before the introduction of this change), but these should be rare. Additionally, we would need to consider how to handle retries. The two options would be:
- Always retry, simple to implement, but could lead to some duplicates
- Check whether a job was added while another instance was running and drop the retry in favour of the more recently queued job. This has some observability issues (such as retries effectively having different
jids
andcorrelation_ids
, which is not ideal).
Personally, for retries I think we should just allow duplicates.
The only guarantee we should make is that the change will never drop a job unless another job with matching identity is queued up (but not running).