Fix zombie entries in running builds table
Summary
The ci_running_builds table contains "zombie" entries - records that persist for builds that are no longer in running status. These orphaned records caused Ci::TimedOutBuilds::DropRunningWorker to enter an infinite loop, processing the same non-running builds repeatedly since they never get dropped from the running builds table.
Scope of Work
- Understand the root cause of zombie entries
- Prevent zombie entries from appearing or periodically clean them up
- Fix existing zombie entries (data remediation)
Steps to reproduce
We believe the issue happens roughly as follows, but we thought the drop! was supposed include the running builds cleanup in its transaction, so there are some unknowns (
- A CI build transitions to
runningstate, creating aCi::RunningBuildentry -
❓ The build terminates abnormally (e.g., runner disconnect, infrastructure failure) without theafter_transition running: anycallback executing -
❓ Theci_running_buildsrecord persists despite the build no longer being inrunningstatus -
Ci::TimedOutBuilds::DropRunningWorkerqueriesCi::RunningBuildto find timed-out builds - The worker loops indefinitely over zombie records since the
.dropoperation succeeds but doesn't remove the running build entry for non-running builds
Example Project
This was observed on GitLab.com production - see INC-7385.
What is the current bug behavior?
-
ci_running_buildscontains records for builds that are not inrunningstatus (failed, canceled, success, etc.) -
Ci::TimedOutBuilds::DropRunningWorkerprocesses these zombie records repeatedly in an infinite loop - The worker never completes, causing timeout detection to fail for legitimately timed-out builds
What is the expected correct behavior?
-
ci_running_buildsshould only contain entries for builds that are actually inrunningstatus - When a build transitions out of
running, itsci_running_buildsentry should always be deleted - The timeout worker should complete successfully without getting stuck on stale entries
Relevant logs and/or screenshots
- Production logs showing 4+ minute execution times: https://log.gprd.gitlab.net/app/r/s/MgBTK
- Related incident: INC-7385 Ci::TimedOutBuilds::DropRunningWorker stopped completing
Output of checks
This bug happens on GitLab.com
Technical Context
Affected components:
-
app/models/ci/running_build.rb- Model for tracking running builds -
app/services/ci/timed_out_builds/drop_running_service.rb- Service that drops timed-out builds -
app/workers/ci/timed_out_builds/drop_running_worker.rb- Worker that executes the service -
app/services/ci/update_build_queue_service.rb- Service that creates/deletes running build entries
Related MRs:
-
!223283 (merged) - Added
.runningfilter as a workaround (merged) - !223223 - Added iteration timeout to prevent infinite loops
Possible fixes
Edited by Hordur Freyr Yngvason