Skip to content

Geo: RegistryConsistencyWorker may not be reenqueuing itself

Follow up from !48031 (comment 450226259):

I think what I'm seeing is perform_async in Reenqueuer never schedules a job? If that's true, then initial backfill in Geo will take a very long time for GitLab versions that deduplicate jobs by default.

I think Reenqueuer is idempotent, but it is not necessarily droppable. Should we remove the idempotent! from Reenqueuer?

Problem

Automatic job deduplication was added for idempotent! jobs at some point. It appears to interact with Reenqueuer logic so that when Reenqueuer calls perform_async, it doesn't actually schedule a job.

Regarding severity:

By chance, RegistryConsistencyWorker batch size was set to 10k for unrelated reasons a while back, so for many customers, the backfill rate with this bug would probably be acceptable. E.g. 6M records would be created in 6M rows / 10k rows per min / 60 min per hour => 10 hours, which seems well within the same order of magnitude of the time it would take sync job schedulers to run sync jobs (e.g. sync a project repo) for each record anyway.

So the problem, if it exists, may not be overly severe.

Note, I was unable to reproduce this problem locally: RegistryConsistencyWorker reenqueued itself when perform returned true. But I think it's worth disabling deduplication for Reenqueuer to be sure this works, and not by coincidence.

Proposal

Add deduplicate :none to Reenqueuer and add tests to the reenqueuer_shared_examples to ensure this for all usages.

To do

  • Smoke test to verify this problem and then the fix
  • Check if this affects LimitedCapacity::Worker in some way as well (maybe not since it uses bulk_perform_async)
Edited by Michael Kozono