Auto Job Retry is performed in a DB transaction and subject of lock contention
Problem
Originally this problem was discovered in #341100 (closed).
Currently, when auto-retry happens in a pipeline, it executes two processes in the same database transaction:
- Update the current job as
failed
status - Create a new job with copying the attributes from the previous job i.e.
Ci::RetryBuildService
Reference: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/ci/build.rb#L360-370
after_transition any => [:failed] do |build|
next unless build.project
if build.auto_retry_allowed?
begin
Ci::Build.retry(build, build.user)
rescue Gitlab::Access::AccessDeniedError => ex
Gitlab::AppLogger.error "Unable to auto-retry job #{build.id}: #{ex}"
end
end
end
This is intended for PipelineProcessWorker
not to accidentally mark the pipeline status as failed
, however, the RetryBuildService
is getting complicated and calling number of queries to PostgreSQL and Redis. According to the Kibana, we see around 100 PG queries and 35 redis calls, all happens in one transaction. This is subject of lock contention that potentially slows down the read/write from PostgreSQL or dead lock in the worst case. For example, we currently perform DB transaction -> Exclusive Lock -> DB transaction.
Proposal
Use AfterCommitQueue#run_after_commit_or_now
to execute the expensive operations from retying a job. With run_after_commit_or_now
if the retry happens from a transition within a transaction, the operations will be delayed until the transaction is committed. If it's triggered by the retry endpoints, it will be executed synchronously. PoC: #390638 (comment 1302384208)
Previous proposal
NOTE: This approach should be examined by domain experts. There might be a better way.
- Add a column
ci_pipelines.being_retried
(boolean) and update totrue
whenbuild.drop
happens andauto_retry_allowed? == true
. - In
PipelineProcessing::AtomicProcessingService
,update_stages!
andupdate_pipeline!
exclude the failed job status from the calculation ifbeing_retried == true
. - Execute
Ci::RetryJobService.new(build.project, build.user).execute(build)
inrun_after_commit
. This allows the retry service runs outside of the transaction ofbuild.drop
.
Please see !108659 (closed) for more information.
At any rates, we should use a feature flag to rollout the change gradually.