Backend: Error raised in /jobs/request caused job to be stuck in running state

Problem

This issue was reported internally as a pipeline had jobs stuck and then timed out.

In that case an Error: canceling statement due to statement timeout was raised after a job had been succesfully assigned to a Runner (on GitLab Rails) but then the request retuned 500 error so the Runner moved on with another job request.

This caused the jobs to be stuck in running state until the jobs were killed by StuckCiJobsWorker after the job timeout.

Root cause

After we assign the job to the runner we should make sure that we can always return the response successfully to the runner. After we assigned the job to the runner, an exception was raised in register_success.

The specific root cause for this is being handled in #348674 (closed). In this issue we should add mitigations in place to avoid exceptions causing larger problems.

Any operations after process_build should be not business critical and we should be able to still continue with returning the job to the runner.

Proposal

We could wrap any operations after assigning the job to the runner with a begin-rescue block, track any exceptions and continue. We can afford to lose some Prometheus metrics for some job operations but we cannot afford to cause jobs to be stuck in running state.

if result.valid?
  track_exceptions do
    @metrics.register_success(result.build)
    @metrics.observe_queue_depth(:found, depth)
  end

  return result # rubocop:disable Cop/AvoidReturnFromBlocks
else
  # ...
end

# ...
def track_exceptions
  yield
rescue StandardError => e
  Gitlab::ErrorTracking.track_and_raise_for_dev_exception(e)
end

cc @grzesiek

Edited Apr 19, 2022 by Mark Nuzzo