Add panic recovery and structured error reporting to job status
Problem
When step.Run panics inside Job.Run, the panic was unrecovered in the background goroutine, crashing the Step Runner process. Additionally, clients had no way to distinguish infrastructure failures from step failures without parsing error message strings.
Solution
This MR adds panic recovery to job execution and introduces structured error reporting in the job status response.
Panic recovery:
- Wraps step functions in
function.Safe, a decorator that usesrecover()to catch panics - Returns a failure result with an internal error containing the panic value and stack trace
- The job processes the error through the normal return path with no special handling
Structured error reporting:
- Adds
StatusErrorproto message withmessageandkindfields - Introduces
ErrorKindenum:unknown,step_failure,internal,cancelled - The
errorfield inStatusis populated when the job has an error - Clients can check
error.kindto classify failures without parsing strings
Error classification logic:
internal: Infrastructure faults (panics, bugs in step runner)cancelled: Context cancellation or deadline exceededstep_failure: All other errors (non-zero exit, step errors)
Screenshot
How errors appear when this MR is paired with Map step-runner errors to job failure reasons (gitlab-runner!6597 - merged).
| Before | After |
|---|---|
![]() |
![]() |
Reference
Relates to Recover panics from step-runner gRPC methods us... (#430 - closed)
Edited by Cameron Swords

