Gitlab-Runner job fails because it is unable to delete old files on Windows.
Steps to reproduce
The issue is not determinist. But basically, just launch jobs.
Actual behavior
Randomly, job files because it is unable to delete old files on Windows. It doesn't fail with all files. The issue only happens on Windows.
Maybe, it happens when CI switches from different remotes (like a fork to upstream) for the same project.
A click on retry, solves the issue.
Expected behavior
Gitlab-Runner deletes old files with success
Relevant logs and/or screenshots
Environment description
Are you using shared Runners on GitLab.com? Or is it a custom installation?
Which executors are used? Please also provide the versions of related tools
like docker info if you are using the Docker executor.
Custom GitLab
Gitlab Runner 10.6.0 (same issue with Gitlab Runner 10.5.0)
Windows 7
Powershell executor
Used GitLab Runner version
Running with gitlab-runner 10.6.0 (a3543a27) on Windows 7 442f1c76
Same here, Git Strategy is set to clone, no cache is setup for this job.
Running with gitlab-runner 10.6.0 (a3543a27) on Windows Server 2016 HyperV 3a30d90cUsing Shell executor...Running on WIN-5QOR38JE2SF...rm: cannot remove '/c/gitlab/builds/3a30d90c/0/group/subgroup/project/node_modules': Directory not empty
We are also experiencing this issue on MacOSX, using either clone or fetch strategy. However: the issue seems to occur BEFORE any git operation is done:
Versions:
gitlab-runner version 11.1.0
GitLab Enterprise Edition 10.8.5-ee 8f03e3e
Output:
Running with gitlab-runner 11.1.0 (081978aa) on ios-3.2 9cb17461Using Shell executor...Running on ios-3...rm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath/SomeProject.build/Release-iphoneos/SomeProject.build/Objects-normal/armv7: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath/SomeProject.build/Release-iphoneos/SomeProject.build/Objects-normal/arm64: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath/SomeProject.build/Release-iphoneos/SomeProject.build/Objects-normal: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath/SomeProject.build/Release-iphoneos/SomeProject.build: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath/SomeProject.build/Release-iphoneos: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath/SomeProject.build: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI/IntermediateBuildFilesPath: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates/SomeProjectCI: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex/ArchiveIntermediates: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build/Intermediates.noindex: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build/Build: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject/build: Directory not emptyrm: /Users/gitlab/builds/9cb17461/0/NL-iOS/SomeProject: Directory not emptyUploading artifacts...Uploading artifacts to coordinator... ok id=1954456 responseStatus=201 Created token=K_2vRnuUERROR: Job failed: exit status 1
Just to update. Discovered the cause of this problem: There were two gitlab-runner processes active, which probably made concurrent changes to the same directory.
A lingering gitlab-runner process was still behind aside from the registered daemon.
First of all, thank you for raising an issue to help improve the GitLab product. We're sorry about this, but this particular issue has gone unnoticed for quite some time.
We're posting this message because this issue meets the following criteria:
No activity in the past month (since 2018-08-24T12:38:09.790Z)
Unlabelled
We'd like to ask you to help us out and determine how we should act on this issue.
If this issue is reporting a bug, please can you attempt to reproduce on the latest version of GitLab or GitLab.com, to help us to understand whether the bug still needs our attention.
If this issue is proposing a new feature, please can you verify whether the feature proposal is still relevant.
After adding a test job (dependencies: - build job), my problem goes away! Test job successfully removes the files, so next pipeline runs without any problem starting with build job.
Looking at this, it seems as what was suggested in #3185 (comment 71808475) which is the git clean -ffdx failing for some reason which is done here. For some reason, git is sometimes failing to remove the directory, when it has child items underneath it.
I did a quick test with the following commands on a Windows Server 2019 using raw git
PS C:\Users\Administrator\Downloads> New-Item -Pathtest-Type Directory Directory: C:\Users\Administrator\DownloadsMode LastWriteTime Length Name---------------------------d----- 7/18/2019 7:14 AM testPS C:\Users\Administrator\Downloads> cd .\test\PS C:\Users\Administrator\Downloads\test> git --versiongit version 2.18.0.windows.1PS C:\Users\Administrator\Downloads\test> git initInitialized empty Git repository in C:/Users/Administrator/Downloads/test/.git/PS C:\Users\Administrator\Downloads\test> New-Item -Path node_modules -Type Directory Directory: C:\Users\Administrator\Downloads\testMode LastWriteTime Length Name---------------------------d----- 7/18/2019 7:15 AM node_modulesPS C:\Users\Administrator\Downloads\test> New-Item -Path node_modules\test.txt -Type File Directory: C:\Users\Administrator\Downloads\test\node_modulesMode LastWriteTime Length Name----------------------------a---- 7/18/2019 7:15 AM 0 test.txtPS C:\Users\Administrator\Downloads\test> git statusOn branch masterNo commits yetUntracked files:(use "git add <file>..." to include in what will be committed) node_modules/nothing added to commit but untracked files present (use "git add" to track)PS C:\Users\Administrator\Downloads\test> git clean -ffxdRemoving node_modules/PS C:\Users\Administrator\Downloads\test>
We don't remove any files manually since we leverage git for this since it' has a better track of everything. The reason I think this is coming for the git clean command is because of the job log it's saying "Removing ....." Which is the output of the git clean command. I check if this is a case of a process holding up a specific file, so I just opened the node_modules/test.txt file in notepad and tried to to run git clean -ffxd which got me a different result:
warning: failed to remove node_modules/: Permission deniedRemoving node_modules/test.txt
I think something similar is happening here, some process it holds the directory/or child and then git is failing to clean it up for some reason. This was the case in #3185 (comment 91519007) as the user said in #3185 (comment 91562800)
I have a few questions for anyone who encounters the problems:
What git version is being used?
What executor are you using? shell?
If it's really a process holding up the git process from finishing I'm not sure GitLab Runner can do anything here. It's git that seems to be failing.
We have the same issue with our Windows gitlab-runners and the cause is always that a (sub) process started by the build is hanging. Most of the time it is vctip.exe keeping a lock on the files, but I've also seen some other executables that don't stop after the build. If I kill the process, the gitlab-runner is able to again run successfully.
My question is: should the gitlab-runner keep a list of all the processes that are started and kill them in a post job if they are still running or should we implement something ourselves for this?
shell executor on Windows with gitlab-runner 12.1.0
Just upgraded the runners to 12.3.0.
Question remains: if a (sub) process started by the build (from the gitlab-runner), should it be killed by the gitlab-runner if they hang or the build is cancelled? Or is the parent process responsible?
I've just seen this a few times in a couple of days after setting up a runner. The last time the project in question was inactive for some time, and I've restarted the runner machine after it was last built, so there should have been no lingering processes (the clean happens on checkout).
Windows 8, git 2.24.0.windows.2, gitlab-runner 12.4.1.
I'm having the same problem. Windows 10, git 2.25.0.windows.1, gitlab-runner 12.6.0.
Some directories do not get removed by git clean. I have not found a workaround yet.
I try to remove a 'build' folder recursively using PowerShell command:
cd src\my-appRemove-Item build -Recurse -Force -ErrorAction Ignore
The error I get is ERROR: Job failed: exit status 1. I can execute exactly the same command in local PowerShell, but I don't know why I get this error when I run the CI/CD runner. Below are the environment details:
Windows 10 Pro 64-bit
Git 2.25.0.windows.1.
Windows PowerShell 5.1.
GitLab Runner 12.4.1.
It has been a year since this issue is reported, any workaround for this?
Running with gitlab-runner 12.7.1 (003fe500) on Docker Runner for building Docker images 6e8756e8Using Docker executor with image docker:latest ......Fetching changes...00:01Reinitialized existing Git repository in /builds/projekt/.git/Checking out d5c8fb9c as master...Removing .Build/Skipping Git submodules setupDownloading artifacts for build_npm (101447)...
not working:
Running with gitlab-runner 12.7.1 (003fe500) on Docker Runner for building Docker images 6e8756e8Using Docker executor with image docker:latest ......Fetching changes...00:03Reinitialized existing Git repository in /builds/projekt/.git/From https://gitlab.domain.com/projekt * [new ref] refs/pipelines/29231 -> refs/pipelines/29231 7c015b746..b398ea6ac BRANCH-140 -> origin/BRANCH-140Checking out b398ea6a as BRANCH-140...Removing bootstrap/cache/packages.phpwarning: failed to remove vendor/fzaninotto/faker/test/Faker/Provider: Directory not emptyRemoving bootstrap/cache/services.phpRemoving cached/Removing vendor/fzaninotto/faker/test/Faker/Provider/AddressTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/BarcodeTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/BaseTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/BiasedTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/ColorTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/CompanyTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/DateTimeTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/HtmlLoremTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/ImageTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/InternetTest.phpRemoving vendor/fzaninotto/faker/test/Faker/Provider/LocalizationTest.php...Removing vendor/webmozartERROR: Job failed: exit code 1
The problem is gone after downgrading the gitlab-runner from 12.7.1 to:
Version: 10.3.0
Git revision: 5cf5e19a
Git branch: 10-3-stable
GO version: go1.8.5
Built: Mon, 15 Jan 2018 10:05:35 +0000
OS/Arch: windows/amd64
Before the downgrade, almost in each pipeline, there was at least one failed job, so I needed manually to restart it.
I can confirm this problem with gitlab-runner 12.6.0 (ac8e767a) and Windows 10 powershell executor. git-clean fails to remove all files from the build directory:
Running with gitlab-runner 12.6.0 (ac8e767a)on VM Windows 10 Home (UM) jYcw3UNcUsing Shell executor...Running on DESKTOP-04U561E...Versteckte Datei wird nicht zur�ckgesetzt - C:\GitLab-Runner\builds\jYcw3UNc\0\xxx\xxx\.gitwarning: failed to remove build/: Directory not emptyRemoving build/ALL_BUILD.vcxprojRemoving build/ALL_BUILD.vcxproj.filtersRemoving build/CMakeCache.txtRemoving build/CMakeFilesRemoving build/cmake_install.cmake...Removing build/_CPack_PackagesERROR: Job failed: exit status 1
Running git clean -d -x -f manually after a fail always succeeds. I presume this is an issue caused by some open file handles accessed by Windows 10.
My current workaround is a pre-clone script which does a loop waiting, removing write-protection and re-running git-clean:
This bug has been opened 2 years ago as Windows only.
We have never used gitlab on Windows, only on Linux. We have not seen this bug until recently, I would say gitlab runner 12.7.2 might still have been unaffected, but 12.9.0 certainly is. But then another Linux user reports that 12.7. 1 is affected #3185 (comment 278014486)
However, the same user @kilian reports later that downgrading his Linux kernel helped. #3185 (comment 278136191) I cannot check our kernel history at this moment (greetings from lockdown combined with IT issues...)
I have the strong feeling very different root causes are getting mixed in this issue.
thank you for this comment! I'm running Linux Gitlab Runner in Docker on a Linux host and found upgrading kernels to one that included this patch fixed all the failure to delete build artifact directory/files issues we were seeing!
Hi, I experience this issue very often on two MacOS machines (Catalina/10.15.4).
Running with gitlab-runner 12.9.0 (4c96e5ad) on compile-client-a aZ42Fx4s Preparing the "shell" executor Using Shell executor... Preparing environment Running on compile-client-a.local... Getting source from Git repository Fetching changes with git depth set to 1... Reinitialized existing Git repository in /Users/compiler/builds/aZ42Fx4s/0/dev/foo/.git/ Checking out e5fd5775 as feature/bar... warning: failed to remove build/foo.module.build/Release/foo.module.build: Directory not empty
The build steps in these branches execute xcodebuild with ccache. Any further information I can provide?
I suspect this issue is related to Windows processes not always being terminated after a build ends, such as #3121 (closed) etc.
Once we've improved the Windows process termination and move away from the problems associated with taskkill, it's likely this will not be an issue anymore.
I managed to solve the problem by placing a cleaning step in the previous pipeline. In my case, the problem was files with very long names in npm_modules. Windows cannot delete.
After finishing the pipeline and deploy, execute the commands:
We've been trialling a new gitlab.com subscription (attempting to move away from Jenkins) and we're seeing these same intermittent failures:
Running with gitlab-runner 14.8.2 (c6e7e194)...Preparing the "shell" executor00:00Using Shell executor......Fetching changes with git depth set to 20...Reinitialized existing Git repository in C:/JenkinsWorkspace/Gitlab-Runner/builds/yJPokT9a/0/zx-lidars/software/.git/Checking out 597e6679 as refs/merge-requests/1/merge...Removing Common/SomeInternalFrameworkOutput...Removing Common/ZephirFileSystem/Release/Removing Tools/ToolsLogin/Release/warning: failed to remove Path/To/A/Library/Of/Ours/vc120.pdb: Invalid argumentwarning: failed to remove Common/Util/Release/: Directory not emptyCleaning up project directory and file based variables00:00ERROR: Job failed: exit status 1
Tried updating git and gitlab runner to the latest and get the same issue.
git version 2.36.0.windows.1
gitlab-runner 14.10.0
Tried reducing the number of concurrent runners from 4 to 1 and the problem seems to have dried up. This is a serious problem for us though as using gitlab CI has little benefit over our old Jenkins based build system (one build running at a time).
Spotted something perhaps crucial today; It would appear that the git checkout is happening on a different node vs the build in a given job:
Reinitialized existing Git repository in C:/Gitlab-Runner/builds/yJPokT9a/2/zx-lidars/software/.git/Checking out 4d269c4e as refs/merge-requests/1/merge...git-lfs/3.1.4 (GitHub; windows amd64; go 1.17.8)Skipping Git submodules setupExecuting "step_script" stage of the job script00:07$ echo "BUILDING AlgorithmTestHarness (C:\Gitlab-Runner\builds\yJPokT9a\0\zx-lidars\software\Path\Top\Some\Code.sln)"BUILDING AlgorithmTestHarness (C:\Gitlab-Runner\builds\yJPokT9a\0\zx-lidars\software\Path\Top\Some\Code.sln)
Note the /2/ vs \0\.
If that's genuinly what's happening, it explains why some builds are failing because they can't access filesand others are failing because they can't remover files when multiple exevutors are active.
I even tried setting up additional identical runners and limiting my executors to 1 per runner. With that in place we see some even weirder behaviour where it cleans the git repo on one runner then builds on a completely different runner:
Reinitialized existing Git repository in C:/Gitlab-Runner/builds/yJPokT9a/0/zx-lidars/software/.git/...BUILDING UnitTest_LcuDataPipeline (C:\Gitlab-Runner\builds\5phH4R3X\0\zx-lidars\software\LcuFirmware\LcuDataPipeline\UnitTest_LcuDataPipeline\UnitTest_LcuDataPipeline.sln)
Scratch that; Support finally got back to me and pointed out what looked like hard-coded paths in our builds.
A bit of digging revealed that was indeed the case; Our dynamic pipeline generating code was listing the path (including the runner instance) to the visual studio source code directory and hence, our build were failing because we were trying to build on instances of the repo that were busy with another job.
Fixing our code to use $CI_PROJECT_DIR where we were referring to our solution files solved the issue for us and it is now running happily with 4 concurrent runners and no file access errors.
The hard coded paths were in our dynamically generated .yml pipeline file which is what configured the child pipeline builds. We'd set the values to be that of the first runner that generated the pipeline so all the child jobs then tried to use that runner's path at the same time. Predictably, bad things happened
Our Windows pipelines (powershell executor) experience errors like:
Remove-Item : Cannot remove item xxxxxxxx The directory is not empty.
We tried manually removing those paths on the runner and then the pipeline will pass. It will however always fail when we run it the second time with GIT_STRATEGY: clone. With GIT_STRATEGY: fetch it sometimes give other kind of errors:
fatal: Unable to find current revision in submodule path xxxxxxx
fatal: not a git repository: xxxxxxx
I guess it is actuall because the runner cannot (completely) remove some paths?
Later we noticed that in the beginning of the log there were different type of errors:
Remove-Item : Cannot remove item C:\xxxxxxxxxx Could not find a part of the path 'xxxxxxxx'
That led us to the 260 char long path limit of Windows. We enabled the long path by changing the value of the registry LongPathsEnabled as described here and here on the runners. That seems to solve the issue with the Could not find a part of the path. And the issue with The directory is not empty also seems to disappear.
In 16.9, since the MR !4525 (merged) is in, users can set the feature flag FF_USE_WINDOWS_JOB_OBJECT, which should reliably kill processes that might be holding files open and preventing their deletion. So if you hit this problem, please set that feature flag. It could eventually become the default if it doesn't cause trouble.
I'm not sure if there are other issues that might prevent file deletion.. Maybe some process changed the permissions on a file to prevent its deletion?
I'm closing this issue because I believe it's been addressed, but if I'm wrong please feel free to re-open or create a new issue.
This doesn't just affect Windows, there are plenty of comments from others using different host OSs.
We still see this this is reliably on a single MacOS runner, other MacOS runners in our pool don't exhibit the issue.
The problem runner fails to remove a large node_modules during the "Getting source from Git repository" stage.
Logging:
warning: failed to remove node_modules/: Directory not empty
before the job fails.
Is there a hard timeout on the cleanup stage? We're wondering if a faulty SSD could be a contributor to it taking too long to clean up, but other operations on the host don't seem affected.
As others have suggested, setting GET_SOURCES_ATTEMPTS, this does help, follow up attempts finish the clean up.