2026-02-12: Ci::TimedOutBuilds::DropRunningWorker stopped completing

Ci::TimedOutBuilds::DropRunningWorker stopped completing (Severity 4)

Problem: Since February 5th, the Ci::timedoutbuilds::DropRunningWorker has stopped completing its runs. The worker stopped sending 'done' job status logs, and logs show millions of records processed but only hundreds of unique records due to repeated processing.

Impact: No deployments or feature flags are currently blocked, and there is no reported customer impact.

Causes: The worker's fetch function runs in an infinite loop, causing it to repeatedly process the same 'zombie' builds that never get dropped due to lingering entries in the running builds table. Interruptions like OOM events or deployments kill the worker without raising exceptions, resulting in missing 'done' status logs and jobs being re-queued despite deduplication settings.

Response strategy: The worker has been disabled in production and staging to reduce database load and prevent duplicate processing. A merge request was added to stop the infinite loop. The worker was then re-enabled, and all metrics are normal.

https://gitlab.slack.com/archives/C0470TMUPBL/p1771323441849209?thread_ts=1771316165.284489&cid=C0470TMUPBL

This ticket was created to track INC-7385, by incident.io 🔥

Edited Feb 17, 2026 by GitLab Infrastructure Service - incident.io
Assignee Loading
Time tracking Loading