Race Condition/network issues in CreateDownstreamPipelineService Causes Bridge to Run Indefinitely
Description:
In the CreateDownstreamPipelineService class, if there's a potential race condition or network congestion problem related to Sidekiq failures the bridge is stuck indefinitely. Currently, the bridge job is set to run with @bridge.run before the status is updated. If, due to network issues or Sidekiq failures, the actual downstream job fails to get created, the bridge ends up running indefinitely. This is because the bridge is already in a "running" state, preventing further retry attempts since we can't change the state of a job that's already running.
return ServiceResponse.error(message: 'Can not run the bridge') unless @bridge.run
service = ::Ci::CreatePipelineService.new(
pipeline_params.fetch(:project),
current_user,
pipeline_params.fetch(:target_revision)
)
downstream_pipeline = service
.execute(pipeline_params.fetch(:source), **pipeline_params[:execute_params])
.payload
log_downstream_pipeline_creation(downstream_pipeline)
update_bridge_status!(@bridge, downstream_pipeline)
Potential Solution:
The order of operations should be rearranged. We should first ensure that the downstream pipeline was created successfully before changing the bridge status. If the downstream pipeline isn't created successfully, we can handle this case explicitly and avoid the bridge running indefinitely.
service = ::Ci::CreatePipelineService.new(
pipeline_params.fetch(:project),
current_user,
pipeline_params.fetch(:target_revision)
)
downstream_pipeline = service.execute(pipeline_params.fetch(:source), **pipeline_params[:execute_params]).payload
if downstream_pipeline.created_successfully?
unless @bridge.run
return ServiceResponse.error(message: 'Can not run the bridge')
end
log_downstream_pipeline_creation(downstream_pipeline)
update_bridge_status!(@bridge, downstream_pipeline)
else
return ServiceResponse.error(message: 'Downstream pipeline creation failed')
end
cc:@mbobin