Fix zombie entries in running builds table

Summary

The ci_running_builds table contains "zombie" entries - records that persist for builds that are no longer in running status. These orphaned records caused Ci::TimedOutBuilds::DropRunningWorker to enter an infinite loop, processing the same non-running builds repeatedly since they never get dropped from the running builds table.

Scope of Work

  • Understand the root cause of zombie entries
  • Prevent zombie entries from appearing or periodically clean them up
  • Fix existing zombie entries (data remediation)

Steps to reproduce

We believe the issue happens roughly as follows, but we thought the drop! was supposed include the running builds cleanup in its transaction, so there are some unknowns (❓)

  1. A CI build transitions to running state, creating a Ci::RunningBuild entry
  2. ❓ The build terminates abnormally (e.g., runner disconnect, infrastructure failure) without the after_transition running: any callback executing
  3. ❓ The ci_running_builds record persists despite the build no longer being in running status
  4. Ci::TimedOutBuilds::DropRunningWorker queries Ci::RunningBuild to find timed-out builds
  5. The worker loops indefinitely over zombie records since the .drop operation succeeds but doesn't remove the running build entry for non-running builds

Example Project

This was observed on GitLab.com production - see INC-7385.

What is the current bug behavior?

  1. ci_running_builds contains records for builds that are not in running status (failed, canceled, success, etc.)
  2. Ci::TimedOutBuilds::DropRunningWorker processes these zombie records repeatedly in an infinite loop
  3. The worker never completes, causing timeout detection to fail for legitimately timed-out builds

What is the expected correct behavior?

  1. ci_running_builds should only contain entries for builds that are actually in running status
  2. When a build transitions out of running, its ci_running_builds entry should always be deleted
  3. The timeout worker should complete successfully without getting stuck on stale entries

Relevant logs and/or screenshots

  • Production logs showing 4+ minute execution times: https://log.gprd.gitlab.net/app/r/s/MgBTK
  • Related incident: INC-7385 Ci::TimedOutBuilds::DropRunningWorker stopped completing

Output of checks

This bug happens on GitLab.com

Technical Context

Affected components:

  • app/models/ci/running_build.rb - Model for tracking running builds
  • app/services/ci/timed_out_builds/drop_running_service.rb - Service that drops timed-out builds
  • app/workers/ci/timed_out_builds/drop_running_worker.rb - Worker that executes the service
  • app/services/ci/update_build_queue_service.rb - Service that creates/deletes running build entries

Related MRs:

  • !223283 (merged) - Added .running filter as a workaround (merged)
  • !223223 - Added iteration timeout to prevent infinite loops

Possible fixes

Edited Feb 18, 2026 by Hordur Freyr Yngvason
Assignee Loading
Time tracking Loading