Skip to content

Make Cross Project Pipeline Status transition Resilient

Summary

We recently had a production incident that Duplicate child pipelines being created. This issue is about how we can improve our CI architecture.

Problem Analysis/Reproduce the problem

Preparation

  • We have the reverted code in your local GitLab instance
  • Project has the following pipeline configuration.
# .gitlab-ci.yml

parent_bridge:
  trigger:
   include: child.yml
# child.yml

job:
  script: echo
  • We simulate that a particular process takes some time.
diff --git a/app/services/ci/pipeline_processing/legacy_processing_service.rb b/app/services/ci/pipeline_processing/legacy_processing_service.rb
index 278fba20283..dfb4717d3b7 100644
--- a/app/services/ci/pipeline_processing/legacy_processing_service.rb
+++ b/app/services/ci/pipeline_processing/legacy_processing_service.rb
@@ -19,7 +19,9 @@ module Ci
         success = process_dag_builds_without_needs || success if initial_process
         success = process_dag_builds_with_needs(trigger_build_ids) || success
 
+        sleep 10
         @pipeline.update_legacy_status
+        sleep 10
 
         success
       end

Consequence

This gives you enormous amount of errors in sidekiq log.

Process analysis

When a new pipeline is created, the following processes are going to happen.

  • In CreateCrossProjectPipelineWorker, an upstream bridge creates a new pipeline.
  • In Ci::CreatePipelineService, Pipeline::Chains build and create a child pipeline and a job.
  • In Ci::CreatePipelineService, it also executes Ci::ProcessPipelineService.
  • In LegacyProcessingService, it processes the builds at first. It triggers PipelineUpdateWorker asynchronously via CommitStatus#schedule_stage_and_pipeline_update.
  • Think that this PipelineUpdateWorker finishes now. Pipelines status is moved from created to pending. This also updates the bridge's status to success via pipeline.update_bridge_status!.
  • In LegacyProcessingService, it processes @pipeline.update_legacy_status, which fails to update the pipeline status since the pipeline status is already pending, and most importantly, state_machine of pipeline sets an error to pipeline.errors.
  • In CreateCrossProjectPipelineService, it executes @bridge.drop!(:downstream_pipeline_creation_failed), which fails due to ActiveRecord::StaleObjectError because the bridge has already been updated above.
  • CreateCrossProjectPipelineWorker is retried.

We can observe this error in https://sentry.gitlab.net/gitlab/gitlabcom/issues/1302952/?query=ActiveRecord::StaleObjectError.

After Retry of CreateCrossProjectPipelineWorker

  • In CreateCrossProjectPipelineWorker, an upstream bridge creates a new pipeline.
  • In Ci::CreatePipelineService, Pipeline::Chains build and create a child pipeline and a job.
  • In Ci::CreatePipelineService, it also executes Ci::ProcessPipelineService.
  • In LegacyProcessingService, it processes the builds at first. It triggers PipelineUpdateWorker asynchronously via CommitStatus#schedule_stage_and_pipeline_update.
  • Think that this PipelineUpdateWorker finishes now. Pipelines status is moved from created to pending. This also updates the bridge's status to success via pipeline.update_bridge_status!. However, since the bridge is already success, this process raises an error. We can observe this error as Ci::Pipeline::BridgeStatusError.
  • PipelineUpdateWorker is retried, but it could be CreateCrossProjectPipelineWorker due to the asynchronous process.

We can observe this error in https://sentry.gitlab.net/gitlab/gitlabcom/?query=Ci%3A%3APipeline%3A%3ABridgeStatusError&statsPeriod=14d.

Proposal

  • Make sure that CreateCrossProjectPipelineWorker doesn't create a duplicate pipeline when the sidekiq worker is retried.
  • Make sure that we don't raise an error in state_machine transition, that could disturb entire pipeline/builds status transition. See https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/ci/build.rb#L333-337 as an example.
  • Make sure that we have a centralized place to update the bridge status. In this context, we should also fix #198354 (closed) and #202239 (closed). Gitlab::OptimisticLocking might be needed as well to combat race condition.
Edited by Fabio Pitino