Fix handling of DAGs within a stage in AtomicProcessingService (!147875) · Merge requests · GitLab.org / GitLab

Stan Hu requested to merge sh-fix-dags-within-stage into master Mar 26, 2024

What does this MR do and why?

Suppose you had a .gitlab-ci.yml with a manual job listed after a needs:

test1:
  needs: [manual1]
  script: exit 0

manual1:
  stage: test
  when: manual
  script: exit 0

Previously the test1 job would be stuck in the created state instead of skipped because AtomicProcessingService did not consider a DAG within a stage. test1 only gets updated to skipped if Ci::ProcessBuildService is called on manual1 first, but since test1 is listed first AtomicProcessingService will usually call Ci::ProcessBuildService on it first. The update order previously depended on the order returned by the database in StatusCollection#all_jobs. (https://gitlab.com/gitlab-org/gitlab/-/blob/3d7a5af7ce7b18c32c0a1e893f05f9c7bc6b4b92/app/services/ci/pipeline_processing/atomic_processing_service/status_collection.rb#L109-125).

This commit introduces a ci_atomic_processing_ordered_update_stage feature flag that performs a topological sort on all jobs in the stage and updates the jobs in that order. This is similar to the approach taken in ResetSkippedJobsService.

Relates to #450395 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Enable the ci_atomic_processing_ordered_update_stage feature flag. In bin/rails console run, Feature.enable(:ci_atomic_processing_ordered_update_stage)
On this branch, create a .gitlab-ci.yml:

manual1:
  stage: test
  when: manual
  script: exit 0

test1:
  stage: test
  needs: [manual1]
  script: exit 0

Run the pipeline and see that it is skipped.
Now switch the order:

test1:
  stage: test
  needs: [manual1]
  script: exit 0

manual1:
  stage: test
  when: manual
  script: exit 0

The pipeline should also be skipped.
For extra credit, create a really large DAG with this script:

jobs = []

200.times do |index|
  jobs << { name: "test#{index}", needs: "test#{index+1}" }
end

jobs << { name: "test200", when: "manual" }

jobs.each do |job|
  puts "#{job[:name]}:"
  puts "  needs: [#{job[:needs]}]" if job[:needs]
  puts "  when: #{job[:when]}" if job[:when]
  puts "  script: exit 0"
end

Use that script in an actual CI job.
In addition, you can apply this diff to read /tmp/dag.yml to ensure that the SQL queries look similar in log/test.log and to profile the memory usage in /tmp/profile.txt:

diff --git a/spec/services/ci/pipeline_processing/atomic_processing_service_spec.rb b/spec/services/ci/pipeline_processing/atomic_processing_service_spec.rb
index 81f21ed70531..c2ca5d62d0d8 100644
--- a/spec/services/ci/pipeline_processing/atomic_processing_service_spec.rb
+++ b/spec/services/ci/pipeline_processing/atomic_processing_service_spec.rb
@@ -928,6 +928,28 @@ def event_on_pipeline(event)
       end
     end
 
+    context 'with large DAG' do
+      let(:config) { File.read('/tmp/dag.yml') }
+
+      let(:pipeline) do
+        Ci::CreatePipelineService.new(project, user, { ref: 'master' }).execute(:push).payload
+      end
+
+      before do
+        stub_feature_flags(ci_atomic_processing_ordered_update_stage: true)
+        stub_ci_pipeline_yaml_file(config)
+        Rails.logger.info "=== before process"
+        require 'memory_profiler'
+        report = MemoryProfiler.report { process_pipeline }
+        Rails.logger.info "=== after process"
+        output = File.open('/tmp/profile.txt','w')
+        report.pretty_print(output)
+      end
+
+      it 'does something' do
+        expect(all_builds_names_and_statuses).to eq({})
+      end
+    end
+
     context 'when dependent jobs are listed after job needs' do
       let(:feature_flag_state) { true }
       let(:config) do

The memory profile shows we allocate a lot of RAM logging queries and backtraces! 😄

profile.txt

Edited Mar 26, 2024 by Stan Hu

Fix handling of DAGs within a stage in AtomicProcessingService

What does this MR do and why?

MR acceptance checklist

How to set up and validate locally

Merge request reports