Sidekiq middleware for dropping duplicate idempotent jobs from a queue

The middleware from #165 (closed) should be extended to drop jobs if the worker is marked as idempotent and a job with the same identity is already enqueued.

It should be possible to disable dropping the jobs using a feature flag.

Retries should run, even if the there was a duplicate already enqueued.


Original proposal:

# Introduction

During the "October 13 / Sunday night Crypto Miner Limit Takedown" incident, (RCA https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8153) a very poorly performing SQL query led to two queues pipeline_processing:stage_update and pipeline_cache:expire_job_cache congest with over one hundred thousand jobs in each queue.

Since the SQL query in these jobs was leading to CPU saturation on the Postgres primary, and each query was taking several seconds to complete, the backload quickly built up.

After a while, it became clear that the arguments for the job in this queue were in fact identical, and since these jobs are idempotent (ie, calling the job more than once has the same effect as calling it a single time) most of the processing happening was completely wasteful.

Proposal

  • Step 1: Explicitly mark jobs as being idempotent. This could follow the pattern established in the Sidekiq attribution work being carried out in gitlab-org/gitlab!18462 (merged). As @reprazent has pointed out, Sidekiq workers should all be idempotent, but in reality, GitLab's are not.

    class MyWorker
      include ApplicationWorker
    
      idempotent
    
      def perform()
        # ....
      end
    end
    1. We should start with pipeline_cache:expire_job_cache as it's already idempotent and was mentioned in the incident.
  • Step 2: add a client side middleware that checks whether a job is idempotent, and has identical arguments/class to an already enqueued job, and if it is enqueued, not running, drop the new job.

    • Having this in place would have reduced the backlog of 100k+ jobs down to a handful.
    • This approach has a built-in negative feedback loop, which helps make our queueing infrastructure less fragile: when an idempotent job starts taking longer and backing up a queue, there is a higher likelihood of it being dropped as a duplicate. This has an effect of dampening down poorly performing idempotent jobs.

Details

Duplicate check mechanism

Sidekiq implements its queues as redis lists (unsurprisingly!). Checking the queues for duplicated would be an O(n) operation and too expensive for production usage. Instead, I propose a client-side and server-side middleware combination, with the following roles. This approach is O(1) so should not have a big impact on production performance:

  1. On client-side job creation: (client-middleware)

    1. Middleware checks whether a job is marked as idempotent, if not it passes through as normal
    2. The middleware calculates the "idempotency string" for the job. This containings the worker class and the arguments for the job.
    3. The middleware calculates the "idempotency hash" for the job. This is simply a hash of the "idempotency string" using SHA256 or other.
    4. The middleware will use this hash as an address of a Redis hash. Using a hash instead of the full idempotency string keeps the keys short while avoiding the (incredibly unlikely, but possible) chance of hash collisions.
    5. The middleware queuries the Redis hash : HEXISTS gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string>
    6. If the result exists, the middleware silently drops the job and returns to the client
    7. If the result does not exist, the middleware, continues with the following steps:
    8. Adds the key to the hash: HSET gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string> 1
    9. Sets a TTL on they hash to one day (this is a safety clean-up mechanism which would only be required during incidents): EXPIRE gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string> 86400
    10. The middleware now continues as normal
  2. On the server-side: (server-middleware)

    1. First three steps are the same as the client side:
    2. Middleware checks whether a job is marked as idempotent, if not it passes through as normal
    3. The middleware calculates the "idempotency string" for the job.
    4. The middleware calculates the "idempotency hash" for the job.
    5. The middleware removes the Redis hash : HDEL gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string>
    6. The middleware now continues as normal

This approach will drop duplicates but maintain performance. For some workers such as pipeline_processing:stage_update, and especially during times of bad performance (due to pg, redis or Gitaly issue), this could lead to a big reduction in sidekiq activity, helping reduce the impact of the incident.

Guarantees

It's important that this middleware is not treated by developers as a guarantee that the job will be dropped. There are race conditions that may lead to jobs being added twice (as it would be before the introduction of this change), but these should be rare. Additionally, we would need to consider how to handle retries. The two options would be:

  1. Always retry, simple to implement, but could lead to some duplicates
  2. Check whether a job was added while another instance was running and drop the retry in favour of the more recently queued job. This has some observability issues (such as retries effectively having different jids and correlation_ids, which is not ideal).

Personally, for retries I think we should just allow duplicates.

The only guarantee we should make is that the change will never drop a job unless another job with matching identity is queued up (but not running).

Edited by Bob Van Landuyt