Make Cross Project Pipeline Status transition Resilient
Summary
We recently had a production incident that Duplicate child pipelines being created. This issue is about how we can improve our CI architecture.
Problem Analysis/Reproduce the problem
Preparation
- We have the reverted code in your local GitLab instance
- Project has the following pipeline configuration.
# .gitlab-ci.yml
parent_bridge:
trigger:
include: child.yml
# child.yml
job:
script: echo
- We simulate that a particular process takes some time.
diff --git a/app/services/ci/pipeline_processing/legacy_processing_service.rb b/app/services/ci/pipeline_processing/legacy_processing_service.rb
index 278fba20283..dfb4717d3b7 100644
--- a/app/services/ci/pipeline_processing/legacy_processing_service.rb
+++ b/app/services/ci/pipeline_processing/legacy_processing_service.rb
@@ -19,7 +19,9 @@ module Ci
success = process_dag_builds_without_needs || success if initial_process
success = process_dag_builds_with_needs(trigger_build_ids) || success
+ sleep 10
@pipeline.update_legacy_status
+ sleep 10
success
end
Consequence
This gives you enormous amount of errors in sidekiq log.
Process analysis
When a new pipeline is created, the following processes are going to happen.
- In
CreateCrossProjectPipelineWorker
, an upstream bridge creates a new pipeline. - In
Ci::CreatePipelineService
,Pipeline::Chains
build and create a child pipeline and a job. - In
Ci::CreatePipelineService
, it also executesCi::ProcessPipelineService
. - In
LegacyProcessingService
, it processes the builds at first. It triggersPipelineUpdateWorker
asynchronously viaCommitStatus#schedule_stage_and_pipeline_update
. - Think that this
PipelineUpdateWorker
finishes now. Pipelines status is moved from created to pending. This also updates thebridge
's status tosuccess
viapipeline.update_bridge_status!
. - In
LegacyProcessingService
, it processes@pipeline.update_legacy_status
, which fails to update the pipeline status since the pipeline status is already pending, and most importantly,state_machine
of pipeline sets an error topipeline.errors
. - In
CreateCrossProjectPipelineService
, it executes@bridge.drop!(:downstream_pipeline_creation_failed)
, which fails due toActiveRecord::StaleObjectError
because thebridge
has already been updated above. -
CreateCrossProjectPipelineWorker
is retried.
We can observe this error in https://sentry.gitlab.net/gitlab/gitlabcom/issues/1302952/?query=ActiveRecord::StaleObjectError.
CreateCrossProjectPipelineWorker
After Retry of - In
CreateCrossProjectPipelineWorker
, an upstream bridge creates a new pipeline. - In
Ci::CreatePipelineService
,Pipeline::Chains
build and create a child pipeline and a job. - In
Ci::CreatePipelineService
, it also executesCi::ProcessPipelineService
. - In
LegacyProcessingService
, it processes the builds at first. It triggersPipelineUpdateWorker
asynchronously viaCommitStatus#schedule_stage_and_pipeline_update
. - Think that this
PipelineUpdateWorker
finishes now. Pipelines status is moved from created to pending. This also updates thebridge
's status tosuccess
viapipeline.update_bridge_status!
. However, since the bridge is alreadysuccess
, this process raises an error. We can observe this error asCi::Pipeline::BridgeStatusError
. -
PipelineUpdateWorker
is retried, but it could beCreateCrossProjectPipelineWorker
due to the asynchronous process.
We can observe this error in https://sentry.gitlab.net/gitlab/gitlabcom/?query=Ci%3A%3APipeline%3A%3ABridgeStatusError&statsPeriod=14d.
Proposal
-
Make sure that CreateCrossProjectPipelineWorker
doesn't create a duplicate pipeline when the sidekiq worker is retried. -
Make sure that we don't raise an error in state_machine
transition, that could disturb entire pipeline/builds status transition. See https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/ci/build.rb#L333-337 as an example. -
Make sure that we have a centralized place to update the bridge status. In this context, we should also fix #198354 (closed) and #202239 (closed). Gitlab::OptimisticLocking
might be needed as well to combat race condition.
Edited by Fabio Pitino