[ci] job timeout should only take `script` section into account.

Summary

Each job has a timeout (either per runner or per repository), supposedly to keep stuck and out-of-bound jobs under control.

Now there are a few time-consuming stages, that can be influenced by the user, namely:

  • before_script
  • script
  • after_script

However, there are some more steps involved that add up to the total time, e.g.:

  • docker image fetch
  • pre_clone_script of the runner.

It would be great, if the timeout (at least the per-project timeout) would only take the user-accountable stages into account.

Steps to reproduce

  • setup CI-runner myrunner
    • configure the runner with a pre_clone_script: sleep 1000
  • configure project FOO to use CI/CD
    • configure the CI-timeout to be 10 minutes
    • configure a job to be run on myrunner
  • trigger a pipeline for project FOO to be executed on runner myrunner

What is the current bug behavior?

Notice how the job will always fail due to timeouts, regardless of the actual time spent in the (user-controlled) .gitlab-ci.yml

What is the expected correct behavior?

I would have expected the timeout to only take the actual user-defined parts into account (script, before_script, post_script,...). So even if the pre_clone_script takes long, or fetching a (largish) docker-image takes long, this doesn't take away time from the actual build process.

Possible fixes

Have the per-project timeout only take those values into account that can be influenced by the user. In order to catch stalled pre_clone_script runs or similar, there might be an additional (per-runner) timeout that only applies to the steps outside ther user's control.

/label ~"CI/CD"

Edited by umlaeute