Implement Graceful Function Termination
This MR is my version of Draft: Thread graceful cancel signal for steps ctx (!401). It adds graceful cancellation of running jobs, so that a step process receives SIGTERM and has time to clean up before being force-killed.
What changed
New gracefulexitcmd package
A thin wrapper around exec.Cmd that changes how process cancellation works:
- Unix: puts the child in its own process group (Setpgid: true) and sends SIGTERM on cancel, then SIGKILL to the whole group after WaitDelay. This ensures background processes spawned by the step script are also cleaned up.
- Windows: uses a Job Object with KILL_ON_JOB_CLOSE to track all descendants, and sends CTRL_BREAK on cancel.
New Proc type (pkg/runner/proc.go)
A small struct holding a cancel channel and GracefulExitDelay. It's the coordination point between the job lifecycle and the executing step — Cancel() closes the channel (safe to call multiple times via sync.Once). This is threaded through GlobalContext → StepsContext → Exec.
Cancellation flow in Exec (function/exec.go)
The exec function now listens on proc.CancelC() in a goroutine. When cancel fires, it cancels the command's context, triggering the gracefulexitcmd's SIGTERM → wait → SIGKILL sequence. Context errors are merged with the command error via errors.Join.
New Cancel gRPC API
proto/step.proto gains a Cancel(CancelRequest) → CancelResponse RPC, distinct from Close. Cancel signals graceful stop but keeps the job alive; Close calls Cancel first, then waits up to GracefulExitDelay + 500ms before marking the job canceled and cleaning up.
Subtle points
- finishC channel is now closed (not sent to), so multiple waiters could use it, and Job.Run calls proc.Cancel() on completion — this is intentional so the goroutine in Exec that listens on CancelC() always unblocks when the job finishes normally too.
- The gracefulExitDelay in the service has a TODO comment noting it should eventually come from RunRequest rather than being a service-level config.
- The Close timeout is GracefulExitDelay + 500ms buffer to give the step process time to catch SIGTERM and exit cleanly before the job is force-marked canceled.