Concurrent jobs on a single Runner sometimes run in the same CI_PROJECT_DIR
Summary
When the global setting is concurrent=4
(probably anything > 1), concurrent jobs on a single Runner sometimes clobber each other by running concurrently in the same build directory (same CI_PROJECT_DIR
). While the exact nature of the clobbering differs based on timing, it is typically that Job1 in test
stage is extracting the cache from build
stage while Job2 in test
stage is running git clean
-- which promptly deletes the Job1 cache directory.
This problem does not occur with every build, it is completely timing-related and quite random (not see for days and then see multiple builds in a row).
The job that does the clobbering and the job that gets clobbered can be associated with the same pipeline or a different pipeline. We've seen both.
The real-life example shown in the "logs" section occurred when two pipelines from two different branches were concurrently active. However, the job that was clobbered was clobbered by another job in the same pipeline. In that particular case, it doesn't appear that two concurrently active pipelines were required but it definitely seems the likelihood of the problem is higher when multiple pipelines from multiple different branches are concurrently active.
Steps to reproduce
The associated gitlab-ci.yml
is attached to this ticket but here is an overview:.gitlab-ci.yml
- Set the global
concurrent
setting to4
- Enable a single Runner
- Configure a pipeline with three stages: build, test, deploy
- Configure keyed caches
- Make concurrent commits to the repo on multiple (i.e. 2 or 3) different branches. This isn't technically necessary but for some reason the likelihood of the problem occurring is higher when pipelines are concurrently active for multiple different branches.
What is the current bug behavior?
Two jobs occasionally (not always) run concurrently in the same directory (CI_PROJECT_DIR
) leading to one job interfering with the other, causing the other to fail.
What is the expected correct behavior?
We expect concurrent jobs to run in separate directories (separate CI_PROJECT_DIR
) or, put another way, only one job runs in a given project directory at any one time.
Relevant logs and/or screenshots
Initial console output for ABC project running unit_test_debug
job (job 5384 in pipeline 773 for commit 77c48776 from feature/15621_BlahBlah):
Running with gitlab-ci-multi-runner 9.2.0 (adfc387)
on KILIMANJARO Test Runner (d450762c)
Using Shell executor...
Running on kilimanjaro...
Fetching changes...
Removing Coverage-feature-15621-blahblah/
HEAD is now at 77c4877 cleaning up RoundRobinTest and unit test.
Checking out 77c48776 as feature/15621_blahblah...
Updating/initializing submodules...
Checking cache for feature-15621-blahblah/DEBUG...
Successfully extracted cache
$ cd $ABC_BUILD_DIR_REL/
$ ctest
Test project /home/gitlab-runner/builds/d450762c/0/abc/abc/Debug-feature-15621-blahblah
Start 1: ABC_IosConfiguration
1/37 Test #1: ABC_IosConfiguration .............. Passed 0.00 sec
Initial console output for ABC project running unit_test_coverage
job (job 5386 in pipeline 773 for commit 77c48776 from feature/15621_BlahBlah):
Running with gitlab-ci-multi-runner 9.2.0 (adfc387)
on KILIMANJARO Test Runner (d450762c)
Using Shell executor...
Running on kilimanjaro...
Fetching changes...
Removing Release-feature-15621-blahblah/
HEAD is now at 77c4877 cleaning up RoundRobinTest and unit test.
Checking out 77c48776 as feature/15621_BlahBlah...
Updating/initializing submodules...
Checking cache for feature-15621-blahblah/COVERAGE...
Successfully extracted cache
$ cd $ABC_BUILD_DIR_REL
$ make coverage
CMake Error: Target DependInfo.cmake file not found
Scanning dependencies of target coverage
CMake Error: Directory Information file not found
[100%] Resetting code coverage counters to zero.
Processing code coverage counters and generating report.
Deleting all .da files in . and subdirectories
Done.
Test project /home/gitlab-runner/builds/d450762c/0/abc/abc/Coverage-feature-15621-blahblah
No tests were found!!!
.
<snip other project-specific error messages>
.
Uploading artifacts...
WARNING: ./Coverage-feature-15621-blahblah/Testing/Temporary/LastTest.log: no matching files
Uploading artifacts to coordinator... ok id=5386 responseStatus=201 Created token=jHjxZzxa
ERROR: Job failed: exit status 1
Both of the above jobs were started at nearly the same time (don't have exact timestamps). The sequence of events is something like this:
- Coverage job 5386 starts first, running in directory
/home/gitlab-runner/builds/d450762c/0/abc/abc/
- Coverage job 5386 extracts the cache from previous build stage. This creates the directory
/home/gitlab-runner/builds/d450762c/0/abc/abc/Coverage-feature-15621-blahblah
- Debug job 5384 starts up sometime after job 5386 extracts its cache but before job 5386 starts running tests
- Debug job 5384 is also running in directory
/home/gitlab-runner/builds/d450762c/0/abc/abc/
(same directory as job 5386) - Debug job 5384 runs standard
GIT_STRATEGY=fetch
which includesgit clean
which leads to job 5384 removing the job 5386Coverage-feature-15621-blahblah/
directory - Coverage job 5386 resumes and attempts to use the cache files it has just extracted to
Coverage-feature-15621-blahblah/
directory but it finds the directory is now empty so that it fails and fails badly
Please don't suggest changing GIT_STRATEGY=none
for test stage. Even though that git strategy might be more appropriate for our 'test' and 'deploy' stages -- and it does at least help with the problem -- it doesn't fix the problem. It doesn't fix the problem because we also see interference between stages, for example, 'build' stage clobbers 'test' stage because we use GIT_STRATEGY=fetch
for the build stage. We aren't willing to change the build stage git strategy because we want a clean slate when starting a build.
Output of checks
See below output from our self-hosted gitlab deployment:
Results of GitLab environment info
Expand for output related to GitLab environment info
root@gitlab:~# gitlab-rake gitlab:env:infoSystem information System: Ubuntu 16.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.3.3p222 Gem Version: 2.6.6 Bundler Version:1.13.7 Rake Version: 10.5.0 Redis Version: 3.2.5 Git Version: 2.11.1 Sidekiq Version:5.0.0
GitLab information Version: 9.2.2-ee Revision: b004167 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql DB Version: 9.6.1 URL: https://gitlab..com HTTP Clone URL: https://gitlab..com/some-group/some-project.git SSH Clone URL: git@gitlab..com:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: no
GitLab Shell Version: 5.0.4 Repository storage paths:
- default: /var/opt/gitlab/git-data/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
root@gitlab:~# gitlab-rake gitlab:check SANITIZE=true Checking GitLab Shell ...GitLab Shell version >= 5.0.4 ? ... OK (5.0.4) Repo base directory exists? default... yes Repo storage directories are symlinks? default... no Repo paths owned by git:root, or git:git? default... yes Repo paths access is drwxrws---? default... yes hooks directories in repos are links: ... 8/1 ... ok 10/2 ... ok 6/3 ... ok 9/4 ... ok 9/5 ... ok 9/6 ... ok 9/7 ... ok 9/8 ... ok Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Access to /var/opt/gitlab/.ssh/authorized_keys: OK Send ping to redis server: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Sidekiq ...
Running? ... yes Number of Sidekiq processes ... 1
Checking Sidekiq ... Finished
Checking Reply by email ...
Reply by email is disabled in config/gitlab.yml
Checking Reply by email ... Finished
Checking LDAP ...
LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab ...
Git configured with autocrlf=input? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config outdated? ... no Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory setup correctly? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) projects have namespace: ... 8/1 ... yes 10/2 ... yes 6/3 ... yes 9/4 ... yes 9/5 ... yes 9/6 ... yes 9/7 ... yes 9/8 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.1.0 ? ... yes (2.3.3) Your git bin path is "/opt/gitlab/embedded/bin/git" Git version >= 2.7.3 ? ... yes (2.11.1) Active users: 14
Checking GitLab ... Finished
Possible fixes
Don't allow multiple concurrent jobs to run in the same directory..gitlab-ci.yml