Proposal: Treat job timeout as a "script timeout"/soft timeout

Problem

GitLab-Runner's job timeout is specified on the GitLab-side when creating a new Runner token. This is used for the whole timeout of a job and all of its stages (from preparing the executor to uploading the artifacts).

A job can override the timeout with a value lower than the Runner job timeout and this is used in exactly the same way.

The problem is that some jobs (fuzz testing, for example) will want to run for the whole duration leaving no time for artifact uploading.

Proposal

Treat the Runner job timeout as a hard timeout: no matter what stage the job is executing at, if this is exceeded, everything is stopped.
Treat the Job override timeout as a soft timeout: when exceeded, the user scripts are terminated, but any time left (runner timeout - job timeout) is allocated for artifact uploading.

Maybe we can add at least some timeout for artifacts, so that they're always executed? Say, 1 minute? We could do this by ensuring that everything before artifacts has a timeout of job timeout - 1 minute. This ensures that the Runner timeout is still a hard timeout.

Implementation

First phase

Update GitLab-Runner to use a separate timeout context for everything before artifact uploading. This would be job timeout - 1 minute.

Second phase

Update GitLab to send both the Runner timeout and job timeout: only one field is sent at the moment
Update GitLab-Runner's separate timeout context for everything before artifact uploading to use the job timeout value.

Compatibility

There's two breaking changes:

The timeout will now always allow on_failure artifacts to be uploaded.
User scripts will run for 1 minute less than they used to, to allow time for artifacts.

How do we handle this?

Not worry about it? (It's a 1 minute difference, so probably small for the majority of jobs).
Behind a feature flag until 17.0?
Job variable to opt-in?
Runner config to opt-in?

Edited Aug 11, 2023 by Arran Walker