Implement Graceful Function Termination

This MR is my version of Draft: Thread graceful cancel signal for steps ctx (!401). It adds graceful cancellation of running jobs, so that a step process receives SIGTERM and has time to clean up before being force-killed.

What changed

New gracefulexitcmd package

A thin wrapper around exec.Cmd that changes how process cancellation works:

  • Unix: puts the child in its own process group (Setpgid: true) and sends SIGTERM on cancel, then SIGKILL to the whole group after WaitDelay. This ensures background processes spawned by the step script are also cleaned up.
  • Windows: uses a Job Object with KILL_ON_JOB_CLOSE to track all descendants, and sends CTRL_BREAK on cancel.

New Proc type (pkg/runner/proc.go)

A small struct holding a cancel channel and GracefulExitDelay. It's the coordination point between the job lifecycle and the executing step — Cancel() closes the channel (safe to call multiple times via sync.Once). This is threaded through GlobalContext → StepsContext → Exec.

Cancellation flow in Exec (function/exec.go)

The exec function now listens on proc.CancelC() in a goroutine. When cancel fires, it cancels the command's context, triggering the gracefulexitcmd's SIGTERM → wait → SIGKILL sequence. Context errors are merged with the command error via errors.Join.

New Cancel gRPC API

proto/step.proto gains a Cancel(CancelRequest) → CancelResponse RPC, distinct from Close. Cancel signals graceful stop but keeps the job alive; Close calls Cancel first, then waits up to GracefulExitDelay + 500ms before marking the job canceled and cleaning up.

Subtle points

  • finishC channel is now closed (not sent to), so multiple waiters could use it, and Job.Run calls proc.Cancel() on completion — this is intentional so the goroutine in Exec that listens on CancelC() always unblocks when the job finishes normally too.
  • The gracefulExitDelay in the service has a TODO comment noting it should eventually come from RunRequest rather than being a service-level config.
  • The Close timeout is GracefulExitDelay + 500ms buffer to give the step process time to catch SIGTERM and exit cleanly before the job is force-marked canceled.

Closes Graceful Job Termination (#298 - closed)

Edited by Axel von Bertoldi

Merge request reports

Loading