Race condition on deletion of gitlab_runner_env file - Still Occurring

Summary

In Race condition on deletion of `gitlab_runner_en... (#38447 - closed), two MRs were merged and were believed to have fixed the problem. Two customers have now reported that the problem is NOT fixed, which is why I've opened this new issue. I have included below all of the information supplied in that original issue. In comments in this issue, I will provide copies of the messages from those two customers.



In parallel jobs that share a GIT_CLONE_PATH, random jobs may occasionally fail with these errors:

Getting source from Git repository
/bin/bash: line 186: <path>/gitlab_runner_env: No such file or directory

or

Running on <runner>...
rm: can't remove '<path>/gitlab_runner_env': No such file or directory

This would appear to be because in GitLab 17.7.0, we started deleting the gitlab_runner_env file at the start/end of jobs as part of this MR.

So a race condition can occur, where job1 deletes the gitlab_runner_env, and job2 attempts to read/delete the file but will fail because job1 has already deleted it.

Context

The customer has shared this workflow with us as to why they use a persistent, shared GIT_CLONE_PATH on an NFS between all jobs:

We use a shared NFS path for the entire pipeline for a combination of reasons:

  • clones for this repo, even shallow ones, are 2+GB
  • the build stage of this pipeline generates another 2+GB of output that needs to be used by downstream jobs
  • the tests for this pipeline generate another 10+GB of output that we often need to inspect after jobs complete, especially if they fail

They also shared:

  • We set all jobs to GIT_STRATEGY: none except our initial bootstrap job which sets GIT_STRATEGY: fetch

Actual behavior

Race condition where some jobs will fail because the gitlab_runner_env file could not be read/deleted.

Expected behavior

Parallel jobs running should not fail on reading/deleting gitlab_runner_env.

Used GitLab Runner version

GitLab 17.7.0.

If they revert back to GitLab 17.6.0, this behavior is no longer observed.

Possible fixes