Sidekiq middleware for dropping duplicate idempotent jobs from a queue

added Background Processing label

changed title from Sidekiq Client Middleware/Mixin for dropping duplicate jobs from a queue to Sidekiq Client Middleware/Mixin for dropping duplicate idempotent jobs from a queue

changed title from Sidekiq Client Middleware/Mixin for dropping duplicate idempotent jobs from a queue to Sidekiq middleware for dropping duplicate idempotent jobs from a queue

Also worth reviewing: https://github.com/mhenrixon/sidekiq-unique-jobs

For Pipelines specifically we likely need more targeted approach: https://gitlab.com/gitlab-org/gitlab-ce/issues/65538. It is described there, this is needed to perform efficient processing of pipelines, especially with DAG.

Currently, we use single theme to drop duplicates:

remove duplicated concurrent, this is achieved with ExclusiveLease.

We will not be able to use sidekiq-unique-jobs, as this will clash with reliable-fetch.

Dropping duplicates should be based on desired outcome: https://github.com/mhenrixon/sidekiq-unique-jobs#locks. This very well describes the different locking mechanisms. I wish we could adapt that gem, it would be likely simplest to have. The idempotent seems to model the :until_expired.

We will not be able to use sidekiq-unique-jobs, as this will clash with reliable-fetch.

This makes me think it might work after all: https://github.com/mhenrixon/sidekiq-unique-jobs/wiki/How-this-gem-interacts-with-Sidekiq

As far as I understand, the gem maintains its own locks on queuing and starting and stopping, similar to what andrew is describing in his deduplication proposal. But I'm not familiar with how reliable fetch hooks into sidekiq.

Dropping duplicates should be based on desired outcome: https://github.com/mhenrixon/sidekiq-unique-jobs#locks. This very well describes the different locking mechanisms. I wish we could adapt that gem, it would be likely simplest to have. The idempotent seems to model the :until_expired.

Wouldn't we want :until_and_while_executing so we cannot schedule new jobs when the same job (same class + arguments) is already scheduled? Though as soon as the job starts, there could be a state change somewhere that warrants a re-run after the first completes.

As an example: Push mirrors get scheduled on each push, we don't need duplicates, but every time a new push comes we need a new job as the currently running one might not have included all changes.

As an example: Push mirrors get scheduled on each push, we don't need duplicates, but every time a new push comes we need a new job as the currently running one might not have included all changes.

Oh this is a neat example! So as you said, it's semantically safe to discard duplicate idempotent jobs only until the first job (the one we didn't discard) starts running. At that point we need to keep the next identical job request, even if it arrives while the 1st one is running, because the scope of the running job is ambiguous.

Do we also need to ensure the 2nd of the identical job requests doesn't start running until the 1st one completes? If we're not certain that the jobs are both idempotent and concurrency-safe, then it seems like they'd need some serialization mechanism, either in the job scheduler or internal to the job itself. (I know we're talking about a specific example here, but I'm asking the concurrency safety question in general, not just in the context of the steps and stateful data of push-mirror jobs.)

I think @reprazent is right about sidekiq-unique-jobs: it can be used with the reliable fetcher because it creates a client and server middleware as this issue proposes. The reliable fetcher hooks into the Sidekiq processor.

Note that it relies on a number of LUA scripts to perform these checks to minimize round-trips: https://github.com/mhenrixon/sidekiq-unique-jobs/tree/master/lib/sidekiq_unique_jobs/lua

There may be some performance hit with using these LUA scripts at scale. Perhaps we should try this gem out first to see if it meets our needs?

@msmiley

Oh this is a neat example! So as you said, it's semantically safe to discard duplicate idempotent jobs only until the first job (the one we didn't discard) starts running. At that point we need to keep the next identical job request, even if it arrives while the 1st one is running, because the scope of the running job is ambiguous.

Yep, that's right, we don't know where in the process the running job is. Something might have changed since it's start causing a new job to be scheduled.

Do we also need to ensure the 2nd of the identical job requests doesn't start running until the 1st one completes? If we're not certain that the jobs are both idempotent and concurrency-safe, then it seems like they'd need some serialization mechanism, either in the job scheduler or internal to the job itself. (I know we're talking about a specific example here, but I'm asking the concurrency safety question in general, not just in the context of the steps and stateful data of push-mirror jobs.)

I think we should be certain that the jobs are concurrency-safe and idempotent. But as a job with identical params should be doing the same thing, we might spend our compute time better on things that aren't running already.

To use the push-mirror example: If the first job is pushing a lot of data, and takes a long time to complete, and the second job was scheduled for a single updated branch. If the second job starts before the first one completes, it's going to do part of the work that the first job is already doing.

@stanhu I noticed we have a ruby option as well: https://github.com/mhenrixon/sidekiq-unique-jobs#reaper

I think we'll need to spend some time investigating to figure out what, if anything, works best for us.

@reprazent Cool! I think that option only pertains to one particular LUA script: https://github.com/mhenrixon/sidekiq-unique-jobs/blob/master/lib/sidekiq_unique_jobs/lua/reap_orphans.lua

Indeed, I have seen these dreaded warning messages before:

BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE. (Redis::CommandError)

@andrewn the approach sounds great to me. Using an O(1) hash lookup to avoid the O(N) linear search makes great sense.

This is a small thing, but figured it might be worth asking about:

For the following steps in the client-side middleware:

The middleware queuries the Redis hash : HEXISTS gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string>

If the result exists, the middleware silently drops the job and returns to the client

If the result does not exist, the middleware, continues with the following steps:

Adds the key to the hash: HSET gitlab:sidekiq:duplicate:<queue_name>:<idempotency hash> <idempotency string> 1

It may be a bit more efficient to combine the above HEXISTS and HSET (steps 5 and 8) into an atomic test-and-set HSETNX. Then use the return value to determine if the hash key existed prior to the call, so your logic in steps 6 and 7 remain the same. This closes one of the windows for a race condition in adding duplicates, and eliminates one Redis call per job.

It may be a bit more efficient to combine the above HEXISTS and HSET (steps 5 and 8) into an atomic test-and-set HSETNX. Then use the return value to determine if the hash key existed prior to the call, so your logic in steps 6 and 7 remain the same. This closes one of the windows for a race condition in adding duplicates, and eliminates one Redis call per job.

Thanks @msmiley, this makes sense

added ServiceSidekiq + 1 deleted label

removed 1 deleted label

added 1 deleted label

added workflow-infraTriage label

added priority2 label

marked this issue as related to gitlab-org/gitlab#20126 (closed)

I think authorized_projects would be an interesting queue to start experimenting with this approach. (See gitlab-org/gitlab#20126 (closed).) The worker is fast and idempotent. We see the queue spike sometimes (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8826) due to a large number of jobs being scheduled.

Some wrinkles that we'd need to consider:

The job arguments may not match because of some jobs which add a second argument for the JobWaiter class. #86 (closed) has a bit more detail here.
The spikes may be more down to many distinct small jobs rather than duplicates. However, gitlab-org/gitlab#20126 (closed) describes a case where a GitLab instance had more jobs in the queue than users, so we definitely can get duplicates

To decide if this is a good queue to start with, we can try to look at the recent spikes in job size in that queue and validate both of those concerns: how many duplicate jobs did we have? How many duplicate jobs did we have if we only consider the first argument?

For the period described in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8826#note_266144930:

How many duplicate jobs did we have?

Not many! The most repetitions of the same argument set was 294, for user 1614863. This manifested as 28 jobs every two hours or so, with a couple of anomalies: https://log.gitlab.net/goto/126fd1d1ac3eeeee6c7b8de4efa35636

How many duplicate jobs did we have if we only consider the first argument?

Lots! https://log.gitlab.net/goto/9e5c8321c9276c43d5d801e0fdef5df1 considers either argument as unique. Top users:

User ID	Count
3517069	22,044
3572394	19,806
3579754	19,806
3585399	19,806
3615491	19,806
3674229	19,806
5078336	19,806
4209170	14,030
4669542	14,030
4366327	1,904

Top waiters (waiters with most unique waited-on jobs):

Waiter	Count
gitlab:job_waiter:4ddc393b-b876-45e2-9eb6-b0f4b51b46a9	1,430
gitlab:job_waiter:08e738ba-ef8d-4fcf-a71e-f0bc1a4f65f0	1,400
gitlab:job_waiter:4a35a3ea-1603-4b84-a6c4-d700b71dabd8	1,393
gitlab:job_waiter:968b8ba9-6af1-4c61-a626-8d8f75c9817f	1,393
gitlab:job_waiter:70d6e907-56d1-4886-9dc4-7f867c9a50c4	1,392
gitlab:job_waiter:73447b61-54ac-4778-87c7-5cc8c3a71a33	1,392
gitlab:job_waiter:b86a0150-4662-4e4b-8d38-d0da8967d171	1,392
gitlab:job_waiter:de47feb6-4e2e-4944-9732-fc6a796d0bc9	1,392
gitlab:job_waiter:39368052-0129-4890-86cd-d2549d477aad	1,386
gitlab:job_waiter:63715fc6-6ea3-4aca-96f9-e4dd7456864a	1,386

I think that because of this, authorized_projects isn't the best queue to start with. Many of the jobs are waited on, which means that we will block another action - typically an HTTP request - to wait for them to happen, or for 10 seconds to pass, whichever is shorter.

We could still handle this. We could only consider the first argument and make other calls wait 10 seconds. Or we could store a list of waiter keys in the only job that's going to be executed and notify all waiters when that job completes. Or something else.

We could also look at whether we need to wait in all cases, and how often we hit the timeout instead of waiting successfully.

However, both of those would add complexity to what's already a pretty involved project. This is a shame, because @andrewn helped me with some Elasticsearch magic to find that over that period, 319,761 authorized_projects jobs used a job waiter, and 106,358 did not. Deduplicating the jobs using a job waiter would be a big win. It's just not the best candidate for the first iteration of this project.

@smcgivern good suggestion, let's confirm that to see if this is a viable approach.

added workflow-infraProposal label and removed workflow-infraTriage label

mentioned in issue #86 (closed)

Another option would be to start with project_daily_statistics: #81 (closed)

The problem here is that the scheduling latency doesn't seem to go high enough: https://log.gitlab.net/goto/60060361ced058fb442e42cefa0c7bfe So we have lots of duplicate jobs that don't overlap (spend time together in the queue). Even drilling down to the level of a single second, we can see that there isn't much - if any - overlap: https://log.gitlab.net/goto/95c3bd8d141d2a8c3a3fe1a68d91ae3e

To effectively validate this issue, I think we need a job that:

Is idempotent.
Is called with identical arguments.
Frequently has multiple jobs with the same arguments enqueued.

authorized_projects fails on 2. project_daily_statistics seems to fail on 3, because it has lots of duplicate jobs that aren't in the queue at the same time.

I'm open to ideas for a good candidate job, because this sounds like a fun project otherwise Maybe we could revisit the jobs mentioned in the description?

I took a look at https://dashboards.gitlab.net/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Global%22,%7B%22expr%22:%22max_over_time(sidekiq_queue_size%7Benvironment%3D%5C%22gprd%5C%22,name!%3D%5C%22authorized_projects%5C%22,name!%3D%5C%22mailers%5C%22,name!%3D%5C%22pipeline_hooks:build_hooks%5C%22,name!%3D%5C%22post_receive%5C%22,name!%3D%5C%22web_hook%5C%22%7D%5B$__interval%5D)%20and%20on(fqdn)%20(redis_connected_slaves%20!%3D%200)%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%221m%22,%22intervalFactor%22:3,%22legendFormat%22:%22%7B%7B%20name%20%7D%7D%22,%22datasource%22:%22Global%22,%22context%22:%22explore%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,false,%22none%22%5D%7D%5D to see if I could find any spiky queues that might be candidates. I rejected:

authorized_projects - see #42 (comment 266875100)
mailers, pipeline_hooks:build_hooks, and web_hook - these all fire some external action, making them not idempotent (and also we can't drop duplicates).
post_receive - this isn't idempotent but is also very unlikely to have duplicates as one of the arguments is a list of changes received.

Eyeballing this suggests pipeline_cache:expire_job_cache might be the best candidate, which was also mentioned in the description

@ayufan does gitlab-org/gitlab#30656 (closed) also touch this queue, or would it potentially be a good candidate for this issue?

The gitlab-org/gitlab#30656 (closed) catches only pipeline_processing.

Sidekiq middleware for dropping duplicate idempotent jobs from a queue

Original proposal:

Proposal

Details

Duplicate check mechanism

Guarantees

Designs

Child items ...

Activity

Sidekiq middleware for dropping duplicate idempotent jobs from a queue

Original proposal:

Proposal

Details

Duplicate check mechanism

Guarantees

Is blocked by

Relates to

Activity