Add panic recovery and structured error reporting to job status

Problem

When step.Run panics inside Job.Run, the panic was unrecovered in the background goroutine, crashing the Step Runner process. Additionally, clients had no way to distinguish infrastructure failures from step failures without parsing error message strings.

Solution

This MR adds panic recovery to job execution and introduces structured error reporting in the job status response.

Panic recovery:

  • Wraps step functions in function.Safe, a decorator that uses recover() to catch panics
  • Returns a failure result with an internal error containing the panic value and stack trace
  • The job processes the error through the normal return path with no special handling

Structured error reporting:

  • Adds StatusError proto message with message and kind fields
  • Introduces ErrorKind enum: unknown, step_failure, internal, cancelled
  • The error field in Status is populated when the job has an error
  • Clients can check error.kind to classify failures without parsing strings

Error classification logic:

  • internal: Infrastructure faults (panics, bugs in step runner)
  • cancelled: Context cancellation or deadline exceeded
  • step_failure: All other errors (non-zero exit, step errors)

Screenshot

How errors appear when this MR is paired with Map step-runner errors to job failure reasons (gitlab-runner!6597 - merged).

Before After

Reference

Relates to Recover panics from step-runner gRPC methods us... (#430 - closed)

Edited by Cameron Swords

Merge request reports

Loading