Design proposal: Graceful cancellation for GitLab Functions
## Summary
When a job is cancelled, functions should be notified and given time to clean up before being forcefully terminated. This replaces the current behavior where context cancellation immediately kills the process.
## User-facing syntax
Function authors declare a `cancel_timeout` under `exec:` in their `func.yml`:
```yaml
spec:
inputs:
name:
type: string
---
exec:
command: ["./my-server", "${{ inputs.name }}"]
cancel_timeout: 10s
```
`cancel_timeout` is the preferred duration between the cancel signal (SIGTERM) and forced termination (SIGKILL). This is a request from the function author, not a guarantee. The client may provide a shorter timeout when calling Cancel, and the effective timeout will be the lesser of the two. Default is `5s` if omitted. Only applies to `exec` functions. `run` and `step` types do not declare their own cancel timeout. Cancellation propagates to the currently executing leaf function.
## Cancellation flow
1. Client calls `Cancel(id, timeout)` RPC with their maximum grace period
2. Step-runner computes `effective_timeout = min(function.cancel_timeout, client.timeout)`
3. SIGTERM sent to the process group (CTRL_BREAK_EVENT on Windows)
4. Wait up to `effective_timeout` for the process to exit
5. If still running, SIGKILL sent to the entire process group (Job Object close on Windows)
6. Client calls `Close(id)` to clean up resources. If the process is still in its grace period, Close accelerates to immediate SIGKILL and cleanup
**Job lifecycle with cancellation:**
```
Run() → job running
│
Cancel(id, timeout) arrives
│
effective_timeout = min(func.cancel_timeout, client.timeout)
│
SIGTERM → process group
│
┌─────┴──────┐
│ wait... │
│ │
process exits timer expires
│ │
│ SIGKILL → process group
│ │
└─────┬───────┘
│
Close(id) → cleanup resources
```
## Proto changes
New `Cancel` RPC:
```protobuf
rpc Cancel(CancelRequest) returns (CancelResponse);
message CancelRequest {
string id = 1;
google.protobuf.Duration timeout = 2;
}
message CancelResponse {}
```
The `timeout` on `CancelRequest` is a hard stop. It overrides the function author's `cancel_timeout` if it is shorter. The function will be forcefully terminated after this duration regardless of what the function requested.
New field on `Definition.Exec`:
```protobuf
message Exec {
repeated string command = 1;
string work_dir = 2;
google.protobuf.Duration cancel_timeout = 3;
}
```
## Builtins
Builtins are compiled to exec functions under the hood. By executing builtins as subprocesses, the step-runner guarantees that all function types share the same cancellation semantics: SIGTERM for graceful shutdown, SIGKILL after the timeout expires, and process group cleanup for descendants.
The step-runner compiles builtins into `step-runner builtin [name]` invocations. Builtin authors declare their cancel policy using a Go type:
```go
type BuiltinDefinition struct {
CancelTimeout time.Duration
}
```
Registration changes from `Register(name, spec, stepFunc)` to `Register(name, spec, definition, stepFunc)`.
The `StepFunc` signature gains a cancel channel so builtins can distinguish "please stop" (SIGTERM received) from "you are about to be killed" (context cancellation before SIGKILL):
```go
type StepFunc func(ctx context.Context, cancel <-chan struct{}, stepsCtx *StepsContext) error
```
The `step-runner builtin` subcommand catches SIGTERM and closes this channel.
## Schema changes
```go
type Exec struct {
Command []string `yaml:"command"`
WorkDir *string `yaml:"work_dir,omitempty"`
CancelTimeout *string `yaml:"cancel_timeout,omitempty"`
}
```
## Design decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Where does `cancel_timeout` live? | Under `exec:` in func.yml | It is an exec property (SIGTERM to SIGKILL timing). Builtins compile to exec. |
| Default cancel_timeout | 5s | Enough for most cleanup without excessive wait |
| Client timeout delivery | Field on Cancel RPC request, not on Job/RunRequest | The client's real deadline is known at cancellation time, not at job start. Sending it with Cancel keeps the Job message focused on execution. |
| Who manages SIGTERM to SIGKILL escalation? | Step-runner internally | Platform-specific signal handling belongs in step-runner, not the caller |
| Who manages cleanup timing? | Caller | Caller controls when to call Close. Close accelerates kill if still in grace period. |
| Run/Step cancel_timeout? | Not supported | `run` delegates to sub-steps sequentially, so only one leaf is ever running and its cancel_timeout applies. `step` references a function whose own cancel_timeout applies. The client's Cancel timeout caps everything regardless. |
| Builtin cancel notification | `cancel <-chan struct{}` | Explicit channel mirrors SIGTERM semantics. Context cancellation is reserved for hard deadline. |
## Prior art
- Supersedes [#298](https://gitlab.com/gitlab-org/step-runner/-/work_items/298) which identified the need but proposed a narrower solution
- Kubernetes `terminationGracePeriodSeconds`, Docker `stop_grace_period`: same pattern of signal, grace period, force kill
issue