Implement Graceful Function Termination/Cancellation

@cam_swords @ajwalker This is the next iteration of Implement Graceful Function Termination (!460 - closed), with slight changes to align with what we agreed with here.

The gist of the changes are (AI generated):

A new JobCtrl type (pkg/runner/jobctl.go) owns two things: a cancel channel and a GracefulExitDelay duration. It is created per-job in the service using the GracefulExitTimeout from the RunRequest proto (defaulting to 30s), then threaded through GlobalContext so every executing step can reach it.

Cancellation signal propagation

Both Exec and Builtin step implementations now spin up a goroutine that watches jobCtrl.Done(). The Exec Function still directly implements cancellation, but the Builtin (meta?) Function no longer does. Instead it's up to individual builtin implementations to watch jobCtrl.Done() and implement cancelation is necessary. The concrete Function does not yet do this, but I'll add that once this MR merges.

When that channel closes (via Cancel() on the job), the goroutine cancels the step's local cmdCtx, which triggers the graceful-exit path in gracefulexitcmd. This is a two-tier cancel model:

  • The outer ctx (from Job.Ctx) still handles the job timeout.
  • The inner cmdCtx is cancelled early on a Cancel() call, giving the process its grace period before SIGKILL.

In exec.go, after cmd.Run() returns, if cmdCtx.Err() != nil the error is joined with the command error so cancellation is correctly classified downstream.

New Cancel gRPC API

A Cancel RPC was added to the proto (proto/step.proto) alongside the existing Close. Cancel signals the job to stop gracefully; Close still blocks until the job finishes (now with a GracefulExitDelay + 500ms buffer) and then cleans up. The Job.finishC channel is now closed (rather than sent to) so multiple readers are correctly unblocked.

Error classification

toStatusError in jobs.go maps errors to proto.StatusError_ErrorKind variants (cancelled, internal, step_failure). A matching ErrorCancelled client-side enum was added. The ErrorKind is surfaced in Status().

Subtle points

  • jobCtrl.Done() returns nil for a zero-value JobCtrl, which blocks forever in a select — safe fallback for contexts where no ctrl is wired.
  • closeRunReturnBuffer (500ms) is added on top of GracefulExitDelay in Close() to give Function.Run time to observe the killed process and return before the timeout fires and forces a cancelled status.
  • finishC was changed from a buffered channel with a send to an unbuffered channel that is close()d, which is the correct pattern for broadcasting to multiple potential waiters.

Fixes Design proposal: Graceful cancellation for GitL... (#441)

Edited by Axel von Bertoldi

Merge request reports

Loading