Shell executor doesn't clean up build directories

Release notes

Previously, the output of builds from your CI/CD pipeline runs could leave artifacts in the build directory. This will reduce the available free space on the disk. With the new, as of 14.3, FF_ENABLE_JOB_CLEANUP feature flag enabled, (off by default), a step will now run at the end of the build to remove files in the build directory.

Problem

I have noticed that our Windows builders have started to fill their disks after only a day or two of building. Almost all of the space used is in \GitLabRunner\builds. It appears that the shell executor (which as far as I can tell is the only supported executor on Windows until #2609 (closed) or !706 (closed) is resolved) is not cleaning up build directories after they have finished.

The behavior is the same for the shell executor on all platforms.

This problem is caused by builds where IsSharedEnv() is true, as well as builds run by e.g. the Docker executor when the /builds directory is host-mounted persistently.

Proposal

Introduce a new FF_ENABLE_JOB_CLEANUP feature flag (disabled by default);
Consolidate a new cleanup job stage with the existing cleanup_file_variables job stage. It is important for this phase to happen at the end of the job, when artifacts/cache have been uploaded, so that we can clean those up without interfering with the logic.

Reuse the cleanup that is done at job startup to apply it on the project director at the end of the job (subject to the FF_ENABLE_JOB_CLEANUP FF), while ensuring that no breaking behavior is introduced. The cleanup should be dependent on GIT_STRATEGY and should take sub-modules into consideration:

GIT_STRATEGY value	Cleanup action
`fetch`	`git clean ${GIT_CLEAN_FLAGS} && git reset --hard`, also for sub-modules depending on `GIT_SUBMODULE_STRATEGY`
`clone`	`rm -rf ${CI_PROJECT_DIR}/`
`none`	do nothing

In short, the behavior depending on the FF would be the following:

FF disabled: Clean up on start.
FF enabled: Clean up on start and finish.

To evaluate the implementation, consider the following scenario of a shell executor running inside an alpine:latest Docker container:

.gitlab-ci.yml

job:
  script:
    - truncate -s 2G artifact.bin
    - df -k -h ${CI_BUILDS_DIR}
  artifacts:
    paths:
      - artifact.bin

config.toml

concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "shell runner"
  url = "https://gitlab.com/"
  executor = "shell"
  executor = "bash"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]

Running this job multiple times should not result in increasing values reported in the Used column of the df tool output. The same should be true with concurrent > 1 once we've saturated the number of allowed concurrent builds (meaning that a directory has been created for each of the possible concurrent jobs under $CI_BUILDS_DIR/$CI_RUNNER_SHORT_TOKEN).

Edited Sep 15, 2021 by Darren Eastman