Concurrent jobs on a single Runner sometimes run in the same CI_PROJECT_DIR
### Summary When the global setting is `concurrent=4` (probably anything > 1), concurrent jobs on a single Runner _sometimes_ clobber each other by running concurrently in the same build directory (same `CI_PROJECT_DIR`). While the exact nature of the clobbering differs based on timing, it is typically that Job1 in `test` stage is extracting the cache from `build` stage while Job2 in `test` stage is running `git clean` -- which promptly deletes the Job1 cache directory. This problem does not occur with every build, it is completely timing-related and quite random (not see for days and then see multiple builds in a row). The job that does the clobbering and the job that gets clobbered can be associated with the same pipeline or a different pipeline. We've seen both. The real-life example shown in the "logs" section occurred when two pipelines from two different branches were concurrently active. However, the job that was clobbered was clobbered by another job in the same pipeline. In that particular case, it doesn't appear that two concurrently active pipelines were required but it definitely seems the likelihood of the problem is higher when multiple pipelines from multiple different branches are concurrently active. ### Steps to reproduce The associated `gitlab-ci.yml` is attached to this ticket but here is an overview:[.gitlab-ci.yml](/uploads/6b5c74b6a9e47371de45157c003cbca3/.gitlab-ci.yml) 1. Set the global `concurrent` setting to `4` 1. Enable a single Runner 1. Configure a pipeline with three stages: build, test, deploy 1. Configure keyed caches 1. Make concurrent commits to the repo on multiple (i.e. 2 or 3) different branches. This isn't technically necessary but for some reason the likelihood of the problem occurring is higher when pipelines are concurrently active for multiple different branches. ### What is the current *bug* behavior? Two jobs occasionally (not always) run concurrently in the same directory (`CI_PROJECT_DIR`) leading to one job interfering with the other, causing the other to fail. ### What is the expected *correct* behavior? We expect **concurrent** jobs to run in separate directories (separate `CI_PROJECT_DIR`) or, put another way, only one job runs in a given project directory at any one time. ### Relevant logs and/or screenshots Initial console output for ABC project running `unit_test_debug` job (job 5384 in pipeline 773 for commit 77c48776 from feature/15621_BlahBlah): ``` Running with gitlab-ci-multi-runner 9.2.0 (adfc387) on KILIMANJARO Test Runner (d450762c) Using Shell executor... Running on kilimanjaro... Fetching changes... Removing Coverage-feature-15621-blahblah/ HEAD is now at 77c4877 cleaning up RoundRobinTest and unit test. Checking out 77c48776 as feature/15621_blahblah... Updating/initializing submodules... Checking cache for feature-15621-blahblah/DEBUG... Successfully extracted cache $ cd $ABC_BUILD_DIR_REL/ $ ctest Test project /home/gitlab-runner/builds/d450762c/0/abc/abc/Debug-feature-15621-blahblah Start 1: ABC_IosConfiguration 1/37 Test #1: ABC_IosConfiguration .............. Passed 0.00 sec ``` Initial console output for ABC project running `unit_test_coverage` job (job 5386 in pipeline 773 for commit 77c48776 from feature/15621_BlahBlah): ``` Running with gitlab-ci-multi-runner 9.2.0 (adfc387) on KILIMANJARO Test Runner (d450762c) Using Shell executor... Running on kilimanjaro... Fetching changes... Removing Release-feature-15621-blahblah/ HEAD is now at 77c4877 cleaning up RoundRobinTest and unit test. Checking out 77c48776 as feature/15621_BlahBlah... Updating/initializing submodules... Checking cache for feature-15621-blahblah/COVERAGE... Successfully extracted cache $ cd $ABC_BUILD_DIR_REL $ make coverage CMake Error: Target DependInfo.cmake file not found Scanning dependencies of target coverage CMake Error: Directory Information file not found [100%] Resetting code coverage counters to zero. Processing code coverage counters and generating report. Deleting all .da files in . and subdirectories Done. Test project /home/gitlab-runner/builds/d450762c/0/abc/abc/Coverage-feature-15621-blahblah No tests were found!!! . <snip other project-specific error messages> . Uploading artifacts... WARNING: ./Coverage-feature-15621-blahblah/Testing/Temporary/LastTest.log: no matching files Uploading artifacts to coordinator... ok id=5386 responseStatus=201 Created token=jHjxZzxa ERROR: Job failed: exit status 1 ``` Both of the above jobs were started at nearly the same time (don't have exact timestamps). The sequence of events is something like this: 1. Coverage job 5386 starts first, running in directory `/home/gitlab-runner/builds/d450762c/0/abc/abc/` 1. Coverage job 5386 extracts the cache from previous build stage. This creates the directory `/home/gitlab-runner/builds/d450762c/0/abc/abc/Coverage-feature-15621-blahblah` 1. Debug job 5384 starts up sometime _after_ job 5386 extracts its cache but _before_ job 5386 starts running tests 1. Debug job 5384 is also running in directory `/home/gitlab-runner/builds/d450762c/0/abc/abc/` (same directory as job 5386) 1. Debug job 5384 runs standard `GIT_STRATEGY=fetch` which includes `git clean` which leads to job 5384 removing the job 5386 `Coverage-feature-15621-blahblah/` directory 1. Coverage job 5386 resumes and attempts to use the cache files it has just extracted to `Coverage-feature-15621-blahblah/` directory but it finds the directory is now empty so that it fails and fails badly Please don't suggest changing `GIT_STRATEGY=none` for test stage. Even though that git strategy might be more appropriate for our 'test' and 'deploy' stages -- and it does at least help with the problem -- it doesn't *fix* the problem. It doesn't fix the problem because we also see interference *between* stages, for example, 'build' stage clobbers 'test' stage because we use `GIT_STRATEGY=fetch` for the build stage. We aren't willing to change the build stage git strategy because we want a clean slate when starting a build. ### Output of checks See below output from our self-hosted gitlab deployment: #### Results of GitLab environment info <details> <summary>Expand for output related to GitLab environment info</summary> <pre> root@gitlab:~# gitlab-rake gitlab:env:info System information System: Ubuntu 16.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.3.3p222 Gem Version: 2.6.6 Bundler Version:1.13.7 Rake Version: 10.5.0 Redis Version: 3.2.5 Git Version: 2.11.1 Sidekiq Version:5.0.0 GitLab information Version: 9.2.2-ee Revision: b004167 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql DB Version: 9.6.1 URL: https://gitlab.<company>.com HTTP Clone URL: https://gitlab.<company>.com/some-group/some-project.git SSH Clone URL: git@gitlab.<company>.com:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: no GitLab Shell Version: 5.0.4 Repository storage paths: - default: /var/opt/gitlab/git-data/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git </pre> </details> #### Results of GitLab application Check <details> <summary>Expand for output related to the GitLab application check</summary> <pre> root@gitlab:~# gitlab-rake gitlab:check SANITIZE=true Checking GitLab Shell ... GitLab Shell version >= 5.0.4 ? ... OK (5.0.4) Repo base directory exists? default... yes Repo storage directories are symlinks? default... no Repo paths owned by git:root, or git:git? default... yes Repo paths access is drwxrws---? default... yes hooks directories in repos are links: ... 8/1 ... ok 10/2 ... ok 6/3 ... ok 9/4 ... ok 9/5 ... ok 9/6 ... ok 9/7 ... ok 9/8 ... ok Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Access to /var/opt/gitlab/.ssh/authorized_keys: OK Send ping to redis server: OK gitlab-shell self-check successful Checking GitLab Shell ... Finished Checking Sidekiq ... Running? ... yes Number of Sidekiq processes ... 1 Checking Sidekiq ... Finished Checking Reply by email ... Reply by email is disabled in config/gitlab.yml Checking Reply by email ... Finished Checking LDAP ... LDAP is disabled in config/gitlab.yml Checking LDAP ... Finished Checking GitLab ... Git configured with autocrlf=input? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config outdated? ... no Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory setup correctly? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) projects have namespace: ... 8/1 ... yes 10/2 ... yes 6/3 ... yes 9/4 ... yes 9/5 ... yes 9/6 ... yes 9/7 ... yes 9/8 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.1.0 ? ... yes (2.3.3) Your git bin path is "/opt/gitlab/embedded/bin/git" Git version >= 2.7.3 ? ... yes (2.11.1) Active users: 14 Checking GitLab ... Finished </pre> </details> ### Possible fixes Don't allow multiple concurrent jobs to run in the same directory.[.gitlab-ci.yml](/uploads/8b1688d603939e079a933868291a4d59/.gitlab-ci.yml)
issue