Only create join record once the build is assigned to a runner
What does this MR do and why?
Describe in detail what your merge request does and why.
This MR fixes the race condition that caused an S2 incident recently. The logic is behind a FF. The problem was caused by the fact that build.runner_machine = runner_machine
creates the join record immediately and not when the build
model is saved. This can cause a race condition if there are multiple runners competing for the same job. The first one will create the p_ci_runner_machine_builds
join record, and the others will cause duplicate key constraint violation errors. By creating the join record only when the runner has been positively assigned the job, we sidestep the issue.
Note: the specs diff should be viewed in a tool that can ignore whitespace differences, as most of the changes are indentation-related.
Closes #396852 (closed)
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
Running the spec against the old code fails tests
......................................................F.............F...................
Failures:
1) Ci::RegisterJobService#execute when using pending builds table allow shared runners when build owner has been blocked with runner machine specified does not pick the build and does not create join record
Failure/Error: expect(Ci::RunnerMachineBuild.all).to be_empty
expected `#<ActiveRecord::Relation [#<Ci::RunnerMachineBuild partition_id: 100, build_id: 105, runner_machine_id: 1>]>.empty?` to be truthy, got false
Shared Example Group: "handles runner assignment" called from ./spec/services/ci/register_job_service_spec.rb:855
# ./spec/services/ci/register_job_service_spec.rb:181:in `block (7 levels) in <module:Ci>'
# ./spec/spec_helper.rb:502:in `block (3 levels) in <top (required)>'
# ./spec/support/sidekiq_middleware.rb:18:in `with_sidekiq_server_middleware'
# ./spec/spec_helper.rb:494:in `block (2 levels) in <top (required)>'
# ./spec/spec_helper.rb:490:in `block (3 levels) in <top (required)>'
# ./lib/gitlab/application_context.rb:59:in `with_raw_context'
# ./spec/spec_helper.rb:490:in `block (2 levels) in <top (required)>'
# ./spec/spec_helper.rb:242:in `block (2 levels) in <top (required)>'
# ./spec/support/system_exit_detected.rb:7:in `block (2 levels) in <main>'
# ./spec/support/database/prevent_cross_joins.rb:108:in `block (3 levels) in <main>'
# ./spec/support/database/prevent_cross_joins.rb:62:in `with_cross_joins_prevented'
# ./spec/support/database/prevent_cross_joins.rb:108:in `block (2 levels) in <main>'
2) Ci::RegisterJobService#execute when using pending builds table deleted projects for project runner with runner machine specified does not pick a build
Failure/Error: expect(Ci::RunnerMachineBuild.all).to be_empty
expected `#<ActiveRecord::Relation [#<Ci::RunnerMachineBuild partition_id: 100, build_id: 119, runner_machine_id: 3>]>.empty?` to be truthy, got false
Shared Example Group: "handles runner assignment" called from ./spec/services/ci/register_job_service_spec.rb:855
# ./spec/services/ci/register_job_service_spec.rb:146:in `block (7 levels) in <module:Ci>'
# ./spec/spec_helper.rb:502:in `block (3 levels) in <top (required)>'
# ./spec/support/sidekiq_middleware.rb:18:in `with_sidekiq_server_middleware'
# ./spec/spec_helper.rb:494:in `block (2 levels) in <top (required)>'
# ./spec/spec_helper.rb:490:in `block (3 levels) in <top (required)>'
# ./lib/gitlab/application_context.rb:59:in `with_raw_context'
# ./spec/spec_helper.rb:490:in `block (2 levels) in <top (required)>'
# ./spec/spec_helper.rb:242:in `block (2 levels) in <top (required)>'
# ./spec/support/system_exit_detected.rb:7:in `block (2 levels) in <main>'
# ./spec/support/database/prevent_cross_joins.rb:108:in `block (3 levels) in <main>'
# ./spec/support/database/prevent_cross_joins.rb:62:in `with_cross_joins_prevented'
# ./spec/support/database/prevent_cross_joins.rb:108:in `block (2 levels) in <main>'
Finished in 1 minute 1.65 seconds (files took 38.18 seconds to load)
88 examples, 2 failures
Failed examples:
rspec ./spec/services/ci/register_job_service_spec.rb:178 # Ci::RegisterJobService#execute when using pending builds table allow shared runners when build owner has been blocked with runner machine specified does not pick the build and does not create join record
rspec ./spec/services/ci/register_job_service_spec.rb:142 # Ci::RegisterJobService#execute when using pending builds table deleted projects for project runner with runner machine specified does not pick a build
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.