Error raised in /jobs/request caused job to be stuck in running state
What does this MR do and why?
This MR fixes an infradev issue coming from an incident where errors during metrics tracking in the CI job registration process were causing jobs to get stuck in a pending state. The problem occurred when exceptions were raised during metrics collection in /jobs/request, preventing the service from properly returning job results to runners.
Key changes:
- Wraps metrics tracking calls in exception handling to prevent failures from blocking job assignment
- Extracts metrics tracking logic into separate methods (
track_successandtrack_conflict) with proper error handling - Uses
track_and_raise_for_dev_exceptionto ensure we're aware of metrics issues in development while not affecting production job processing - Adds comprehensive test coverage for metrics error scenarios
Why this matters: When metrics tracking failed, the entire job registration process would fail, leaving jobs in a pending state indefinitely. This fix ensures that metrics failures don't prevent runners from receiving job assignments, maintaining CI/CD pipeline reliability.
References
Closes #348673 (closed)
How to set up and validate locally
- Set up a GitLab development environment with runners configured
- Create a test pipeline with jobs
- To simulate the error condition, you can monkey-patch the metrics service:
# In rails console allow_any_instance_of(::Gitlab::Ci::Queue::Metrics).to receive(:register_success) .and_raise(StandardError, 'metrics failure')
Edited by Allison Browne