Corrupted lock files if GIT_SUBMODULE_STRATEGY is used

Summary

We run a pipeline that builds the Linux kernel (e.g. a pretty big repo). To do so, we include Linux as a submodule to our main repository. However, we see lots of random failures before (step_)script execution.

For example:

Preparing environment
00:01
Running on runner-3uiywcvu-project-21809392-concurrent-2 via 16e3eac82f0b...
Getting source from Git repository
00:02
Fetching changes with git depth set to 1...
Reinitialized existing Git repository in /builds/.../kernel/.git/
Checking out 76b7eddc as v1.5.2...
Updating/initializing submodules recursively...
Synchronizing submodule url for 'linux'
Entering 'linux'
fatal: Unable to create '/builds/.../kernel/.git/modules/linux/index.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
fatal: run_command returned non-zero status for linux
.
Cleaning up file based variables
00:01
ERROR: Job failed: exit code 1

Preparing environment
00:03
Running on runner-3uiywcvu-project-21809392-concurrent-1 via 16e3eac82f0b...
Getting source from Git repository
00:06
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/.../kernel/.git/
Checking out d3ba28d2 as v1.4.0-rc1...
Updating/initializing submodules recursively...
Synchronizing submodule url for 'linux'
Entering 'linux'
fatal: Needed a single revision
Unable to find current revision in submodule path 'linux'
Cleaning up file based variables

Both are quite strange and un-explainable. The later error happens "more often". The only thing I can imagine, is that there's some cache invalidation going wrong, in that some parallel jobs use the same caches. We do run a few jobs in parallel. For example we run a linter in parallel with the compilation, or run 3 'deploy' jobs after the build is complete (deploy docker container, deploy artifacts, 'git push to remote repo' deploy). This is in the above example only true for the first of the two, the 'Need a single revision' was with 2 jobs, in sequence (e.g. one in parallel).

Steps to reproduce

This is of course a bit harder to describe, I can work on a demo repo; but without setting up 'everything' in a public space, I wonder how reproduceable it is ...

This seems to smell a little bit like re-using caches that are not yet 'free'. Either caches that are shared between jobs or caches that are not properly cleaned?

Used GitLab Runner version

Running with gitlab-runner 13.5.0 (ece86343)
...
Preparing the "docker" executor
00:08
Using Docker executor with image d3ba28d2:212956109 ...

Edited Nov 10, 2021 by Darren Eastman