Skip to content

Backend: Stage play manual jobs leave some jobs in skipped state

Problem

When using manual jobs , the use of the stage-level play button can cause improper transitions and leave some further jobs in a skipped state.

Technical details as mentioned by @furkanayhan :

  • The "Play All" button triggers Ci::PlayBuildService for each manual job in the stage.
  • They both call Ci::EnqueueJobService.
  • In ResetSkippedJobsService, we lock jobs one by one but I think there is a race condition because of simultaneous workers. Maybe we should lock all jobs instead of one by one.

More information in this following thread

Investigation

Proposal

We are proceeding with the latest proposed solution. As part of this solution, it was also deemed necessary to update ResetSkippedJobsService to support multiple jobs as input by default, for performance reasons. This effort is being tracked in a separate issue: Backend: Update ResetSkippedJobsService to work... (#410223 - closed).

Additionally, adjacent to the current problem, we determined that it would be best to update the PipelineProcessWorker deduplication strategy from :until_executing to :until_executed, if_deduplicated: :reschedule_once. The purpose of this change is to:

  • Provide more clarity on pipeline processing.
  • Improve performance by reducing the number of jobs that run and then are immediately dropped from not obtaining the lease in AtomicProcessService.execute.

Implementation

Description MR / Issue
Update PipelineProcessWorker deduplication strategy to until_executed !115261 (merged)
[Feature flag] Roll out ci_pipeline_process_worker_dedup_until_executed #397829 (closed)
Remove ci_pipeline_process_worker_dedup_until_executed feature flag !120174 (merged)
(Prerequisite) Backend: Update ResetSkippedJobsService to work with multiple jobs as input #410223 (closed)
Reset skipped jobs on new alive jobs during pipeline processing !118269 (merged)
[Feature flag] Roll out ci_reset_skipped_jobs_in_atomic_processing #410203 (closed)

We can use the following logs to monitor the frequency of occurrence:

Kibana Sidekiq logs: https://log.gprd.gitlab.net/goto/fd383b80-0bae-11ee-a017-0d32180b1390 (Filtered by json.message: "Running ResetSkippedJobsService on new alive jobs")

Edited by Leaminn Ma