Map step-runner errors to job failure reasons

Summary

When the step-runner panics and recovers via its gRPC interceptor, the resulting codes.Internal error was not being mapped to a specific failure reason. Additionally, the existing ClientStatusError handling used State instead of ErrorKind for classification, which didn't properly distinguish between different error types.

These issues caused failures to be misclassified, making it harder to distinguish runner infrastructure issues from user script errors in job metrics and debugging.

Changes

Error classification updates in common/build.go:

Maps gRPC errors from the step-runner to appropriate job failure reasons:

  • codes.Internal (gRPC handler panics) → ScriptFailure
  • ErrorInternal (step-runner internal errors) → ScriptFailure
  • ErrorStepFailure (standard step failures) → ScriptFailure
  • ErrorUnknown (unspecified errors) → UnknownFailure
  • ErrorCancelled → leaves failure reason unset (handled elsewhere)

Note: Both codes.Internal and ErrorInternal are intentionally classified as ScriptFailure rather than RunnerSystemFailure. A malicious job could deliberately trigger either path to forge a RunnerSystemFailure and evade job-failure accounting.

New error type in steps/execute.go:

Introduces ClientInternalError to distinguish gRPC-level codes.Internal errors (e.g., panic recovery via step-runner's interceptor) from ClientStatusError which reports a Status returned by step-runner.

Resource cleanup fix:

Adds defer stdout.Close() to properly close the build logger stream in executeStepStage.

Step-runner upgrade (v0.32.x → v0.34.0)

This MR upgrades step-runner to v0.34.0, which includes the following changes:

v0.34.0

  • Add structured error reporting to job status via StatusError with ErrorKind enum, enabling clients to distinguish infrastructure failures from step failures without parsing error strings (!469 (closed))

v0.33.0

  • Fix cloning steps when branch ref contains a forward slash (!481 (merged))
  • Catch panics in step function execution and return them as errors (!438 (merged))
  • Add Duo review rule to ensure MRs contain changelog entries (!448 (closed))
  • Recover panics in gRPC handlers and return codes.Internal to the caller instead of crashing the server goroutine (!457 (merged))
  • Implement Graceful Process Termination for Exec Function (!463 (merged))
  • Properly handle really long lines from subprocess output (!471 (closed))
  • Isolate builtin functions from step results by introducing a BuiltinContext interface (!475 (closed))
  • Add docker/auth builtin function for configuring Docker registry authentication (!484 (merged))

Screenshot

How errors appear when this MR is paired with Add panic recovery and structured error reporti... (step-runner!469 - merged).

Before After

Reference

Relates to Recover panics from step-runner gRPC methods us... (step-runner#430 - closed)

Edited by Cameron Swords

Merge request reports

Loading