Skip to content

Fix registry race condition

Steve Abrams requested to merge 345598-registry-race-condition into master

🐬 Context

There are times when users face a race condition when pushing a new image to the registry. When an image is pushed, multiple requests are made to the registry, and with each request, a ContainerRepository record is created if one does not exist. We previously tried to solve this by introducing safe_find_or_create_by! when a new repository record is created, however, that did not fix the problem. 😞

The problem is we have a uniqueness model constraint on [:project_id, :name]. safe_find_or_create_by! runs an additional record.validate!, which will throw a ActiveRecord::RecordInvalid error. If we switch to safe_find_or_create_by, we skip the validation, and an attempt to create a duplicate record will hit the UNIQUE constraint at the database level. This will throw a ActiveRecord::RecordNotUnique which safe_find_or_create_by does rescue.

🎱 What does this MR do and why?

To fix the problem, we replace .safe_find_or_create_by! with .safe_find_or_create_by which then avoids the model validation errors created by the race condition.

I've also renamed the method being updated since it's behavior also includes finding.

To test that the race condition is gone, I've added a test based on the script in the below section How to set up and validate locally. We don't have an expectation that tests an exception does not occur, however, if an exception is raised does occur, the test will fail.

📽 Screenshots or screen recordings

See section below for the script being used to test this update.

Before the change:

→ be rake gitlab:packages:container_repository_race_condition
Done
Done
Done
Done
Done
Done
Done
Kaboom! #<ActiveRecord::RecordInvalid: Validation failed: Name has already been taken>
Kaboom! #<ActiveRecord::RecordInvalid: Validation failed: Name has already been taken>
Kaboom! #<ActiveRecord::RecordInvalid: Validation failed: Name has already been taken>

After the change:

→ be rake gitlab:packages:container_repository_race_condition
Done
Done
Done
Done
Done
Done
Done
Done
Done
Done

🖍 How to set up and validate locally

Unfortunately, while the race condition is somewhat easily reproduced on GitLab.com, it is not easily reproduced locally since the environment is simpler (no load balancing, distributed instances, etc...).

Using the script laid out in https://blog.arkency.com/2015/09/testing-race-conditions/ (thanks to @10io for finding this and coming up with the script below!), we are able to recreate the race condition calling the method in question directly.

  1. Set up a script in lib/tasks/gitlab/packages/race_condition.rake (or wherever you'd like to run the script from):

    # frozen_string_literal: true
    
    desc "GitLab | Packages | Container Repository race condition"
    namespace :gitlab do
      namespace :packages do
        task container_repository_race_condition: :environment do
          project = Project.find(1)
          path = project.full_path + '/some/image'
          path = ContainerRegistry::Path.new(path)
    
          wait_for_it = true
    
          threads = Array.new(10) do |i|
            Thread.new do
              true while wait_for_it
    
              begin
                ::ContainerRepository.create_from_path!(path)
                puts 'Done'
              rescue => e
                puts "Kaboom! #{e.inspect}"
              end
            end
          end
          wait_for_it = false
          threads.each(&:join)
        end
      end
    end
  2. Ensure you have a Project with an ID of 1.

  3. Run the script: bundle exec rake gitlab:packages:container_repository_race_condition

  4. You should only see Done in the output, Kaboom! should not appear.

  5. If you want to re-run the script, you will need to either delete the container repository that was created (ContainerRepository.last.destroy), or change the path in the script before re-running.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #345598 (closed)

Edited by Steve Abrams

Merge request reports