Skip to content

Race condition on deletion of `gitlab_runner_env` file if all jobs share the same `GIT_CLONE_PATH`

Summary

In parallel jobs that share a GIT_CLONE_PATH, random jobs may occasionally fail with these errors:

Getting source from Git repository
/bin/bash: line 186: <path>/gitlab_runner_env: No such file or directory

or

Running on <runner>...
rm: can't remove '<path>/gitlab_runner_env': No such file or directory

This would appear to be because in GitLab 17.7.0, we started deleting the gitlab_runner_env file at the start/end of jobs as part of this MR.

So a race condition can occur, where job1 deletes the gitlab_runner_env, and job2 attempts to read/delete the file but will fail because job1 has already deleted it.

Context

The customer has shared this workflow with us as to why they use a persistent, shared GIT_CLONE_PATH on an NFS between all jobs:

We use a shared NFS path for the entire pipeline for a combination of reasons:

  • clones for this repo, even shallow ones, are 2+GB
  • the build stage of this pipeline generates another 2+GB of output that needs to be used by downstream jobs
  • the tests for this pipeline generate another 10+GB of output that we often need to inspect after jobs complete, especially if they fail

They also shared:

  • We set all jobs to GIT_STRATEGY: none except our initial bootstrap job which sets GIT_STRATEGY: fetch

Actual behavior

Race condition where some jobs will fail because the gitlab_runner_env file could not be read/deleted.

Expected behavior

Parallel jobs running should not fail on reading/deleting gitlab_runner_env.

Used GitLab Runner version

GitLab 17.7.0.

If they revert back to GitLab 17.6.0, this behaviour is no longer observed.

Possible fixes