Batch database updates for cancel_build to improve performance at scale

Problem

Currently, Ci::CancelPipelineService#cancel_jobs processes job cancellations individually using the state machine event. For pipelines with many jobs (e.g., 1000+ cancelable jobs), this approach could be optimized by batching database updates instead of processing each job separately.

Background

Related to !199937 (merged) and #382065 (closed)

In the current implementation, we iterate through jobs and call cancel_job on each one individually, which triggers the state machine event. While this ensures proper state transitions, it's not optimal for large-scale pipeline cancellations.

Related Discussion

From !199937 (merged):

I think this is possible, it's not doing that much so we probably could do a batch database update, but that would mean we would need to not use the state machine event.

@fabiopitino raised: "Can we make cancel_jobs more resilient at scale? Imagine that a pipeline has 1000 cancelable jobs."

Proposal

Investigate and implement batch database updates for job cancellation to improve performance when canceling pipelines with many jobs. This would likely require:

  1. Not using the state machine event for individual jobs
  2. Batch updating job statuses directly in the database
  3. Ensuring all necessary side effects of the state machine are still handled correctly
  4. Maintaining data consistency and proper state transitions

Rough idea:

def target_cancel_status_for(job)
  # TODO: Add preloads so these won't trigger an N+1
  if job.running? && job.supports_canceling?
    'canceling'
  else
    'canceled'
  end
end

def batch_cancel_jobs(jobs)  
  jobs_by_status = jobs_to_cancel.group_by { |job| target_cancel_status_for(job) }
  
  jobs_by_status.each do |target_status, job_group|
    job_ids = job_group.map(&:id)
    
    CommitStatus.where(id: job_ids).update_all(
      status: target_status,
      finished_at: Time.current,
      updated_at: Time.current
    )
    
    # Handle necessary side effects that the state machine would normally trigger
    # (e.g., cleanup, notifications, etc.)
  end
end

Benefits

  • Improved performance for canceling large pipelines
  • Reduced database load when processing many job cancellations
  • More resilient cancellation process at scale
Edited by Allison Browne