2025-09-02: Jobs stuck in running indefinitely
Jobs stuck in running indefinitely (Severity 3 (Medium))
Problem: Since early August, some CI jobs remain indefinitely stuck in the running state because runners never receive them, even though GitLab logs show successful job delivery. Clients receive no response, and the problem is consistently reproducible with large pipelines. Similar connection resets are seen in API calls and Git HTTP calls intermittently.
Impact: More than 20 customers have reported CI jobs stuck in the running state. Sporadic failures are also seen in API jobs. The issue is reproducible with large pipelines and is not limited to any specific region or cluster.
Causes: Switching from kube-proxy to Cilium in our Kubernetes clusters causes abrupt connection drops, especially when Puma is under high load and readiness probes fail. This has been reproduced under load in environments using Cilium, with NGINX error logs showing spikes in 'Connection reset by peer' and Workhorse logs surfacing 'Handler aborted connection' messages. These findings match reports from both SaaS and self-managed customers running Cilium.
Response strategy: We have raised a merge request to increase the Puma readiness probe interval and failure thresholds from 3/2/2 to 6/4/2, aiming to reduce timeouts and pod flapping. Work is underway to move readiness checks into Workhorse for more accurate health monitoring, with related merge requests in progress. The team is also monitoring upstream Cilium changes for future improvements.
This ticket was created to track INC-3683, by incident.io