2025-09-02: Jobs stuck in running indefinitely

Jobs stuck in running indefinitely (Severity 3)

Problem: Since early August, some CI jobs remained indefinitely stuck in the running state because runners never received them, even though GitLab logs showed successful job delivery. Clients received no response, and the problem was consistently reproducible with large pipelines. Similar connection resets were seen in API calls and Git HTTP calls intermittently.

Impact: More than 20 customers have reported CI jobs stuck in the running state. Sporadic failures were also seen in API jobs. After shifting all traffic for /api/v4/jobs/request to the ci-jobs-api deployment, we are no longer seeing 'unexpected EOF' errors in runner logs. Some NGINX 'connection reset by peer' errors still occur during Kubernetes pod scale downs, but the risk of jobs getting stuck appears reduced.

Causes: Switching from kube-proxy to Cilium in our Kubernetes clusters causes abrupt connection drops, especially when Puma is under high load and readiness probes fail. This was reproduced under load in environments using Cilium, with NGINX error logs showing 'Connection reset by peer' and Workhorse logs reporting 'Handler aborted connection' messages. These findings match reports from both SaaS and self-managed customers running Cilium.

Response strategy: We shifted all traffic for /api/v4/jobs/request to the ci-jobs-api deployment, which has reduced 'unexpected EOF' errors in runner logs. Logging inconsistencies are being addressed. To further reduce errors during shutdown, we have created a follow-up issue to enhance Workhorse's graceful termination for long-polling requests. A related merge request proposes returning a 204 status for long-poll requests during shutdown to prevent jobs from getting stuck.


This ticket was created to track INC-3683, by incident.io 🔥