Design proposal: Graceful cancellation for GitLab Functions
## Summary When a job is cancelled, functions should be notified and given time to clean up before being forcefully terminated. This replaces the current behavior where context cancellation immediately kills the process. ## User-facing syntax Function authors declare a `cancel_timeout` under `exec:` in their `func.yml`: ```yaml spec: inputs: name: type: string --- exec: command: ["./my-server", "${{ inputs.name }}"] cancel_timeout: 10s ``` `cancel_timeout` is the preferred duration between the cancel signal (SIGTERM) and forced termination (SIGKILL). This is a request from the function author, not a guarantee. The client may provide a shorter timeout when calling Cancel, and the effective timeout will be the lesser of the two. Default is `5s` if omitted. Only applies to `exec` functions. `run` and `step` types do not declare their own cancel timeout. Cancellation propagates to the currently executing leaf function. ## Cancellation flow 1. Client calls `Cancel(id, timeout)` RPC with their maximum grace period 2. Step-runner computes `effective_timeout = min(function.cancel_timeout, client.timeout)` 3. SIGTERM sent to the process group (CTRL_BREAK_EVENT on Windows) 4. Wait up to `effective_timeout` for the process to exit 5. If still running, SIGKILL sent to the entire process group (Job Object close on Windows) 6. Client calls `Close(id)` to clean up resources. If the process is still in its grace period, Close accelerates to immediate SIGKILL and cleanup **Job lifecycle with cancellation:** ``` Run() → job running │ Cancel(id, timeout) arrives │ effective_timeout = min(func.cancel_timeout, client.timeout) │ SIGTERM → process group │ ┌─────┴──────┐ │ wait... │ │ │ process exits timer expires │ │ │ SIGKILL → process group │ │ └─────┬───────┘ │ Close(id) → cleanup resources ``` ## Proto changes New `Cancel` RPC: ```protobuf rpc Cancel(CancelRequest) returns (CancelResponse); message CancelRequest { string id = 1; google.protobuf.Duration timeout = 2; } message CancelResponse {} ``` The `timeout` on `CancelRequest` is a hard stop. It overrides the function author's `cancel_timeout` if it is shorter. The function will be forcefully terminated after this duration regardless of what the function requested. New field on `Definition.Exec`: ```protobuf message Exec { repeated string command = 1; string work_dir = 2; google.protobuf.Duration cancel_timeout = 3; } ``` ## Builtins Builtins are compiled to exec functions under the hood. By executing builtins as subprocesses, the step-runner guarantees that all function types share the same cancellation semantics: SIGTERM for graceful shutdown, SIGKILL after the timeout expires, and process group cleanup for descendants. The step-runner compiles builtins into `step-runner builtin [name]` invocations. Builtin authors declare their cancel policy using a Go type: ```go type BuiltinDefinition struct { CancelTimeout time.Duration } ``` Registration changes from `Register(name, spec, stepFunc)` to `Register(name, spec, definition, stepFunc)`. The `StepFunc` signature gains a cancel channel so builtins can distinguish "please stop" (SIGTERM received) from "you are about to be killed" (context cancellation before SIGKILL): ```go type StepFunc func(ctx context.Context, cancel <-chan struct{}, stepsCtx *StepsContext) error ``` The `step-runner builtin` subcommand catches SIGTERM and closes this channel. ## Schema changes ```go type Exec struct { Command []string `yaml:"command"` WorkDir *string `yaml:"work_dir,omitempty"` CancelTimeout *string `yaml:"cancel_timeout,omitempty"` } ``` ## Design decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Where does `cancel_timeout` live? | Under `exec:` in func.yml | It is an exec property (SIGTERM to SIGKILL timing). Builtins compile to exec. | | Default cancel_timeout | 5s | Enough for most cleanup without excessive wait | | Client timeout delivery | Field on Cancel RPC request, not on Job/RunRequest | The client's real deadline is known at cancellation time, not at job start. Sending it with Cancel keeps the Job message focused on execution. | | Who manages SIGTERM to SIGKILL escalation? | Step-runner internally | Platform-specific signal handling belongs in step-runner, not the caller | | Who manages cleanup timing? | Caller | Caller controls when to call Close. Close accelerates kill if still in grace period. | | Run/Step cancel_timeout? | Not supported | `run` delegates to sub-steps sequentially, so only one leaf is ever running and its cancel_timeout applies. `step` references a function whose own cancel_timeout applies. The client's Cancel timeout caps everything regardless. | | Builtin cancel notification | `cancel <-chan struct{}` | Explicit channel mirrors SIGTERM semantics. Context cancellation is reserved for hard deadline. | ## Prior art - Supersedes [#298](https://gitlab.com/gitlab-org/step-runner/-/work_items/298) which identified the need but proposed a narrower solution - Kubernetes `terminationGracePeriodSeconds`, Docker `stop_grace_period`: same pattern of signal, grace period, force kill
issue