ArchiveTraceWorker fails to remove '<ARTIFACTS_PATH>/tmp/cache/<CACHE_ID>' directory when 'gitlab_rails['artifacts_path']' is on an NFS mount in 14.0
Summary
When gitlab_rails['artifacts_path']
is located on an NFS mount, the <ARTIFACTS_PATH>/tmp/cache/<CACHE_ID>
directory generated by ArchiveTraceWorker
will not be removed.
Here is the chain of events:
- Ci::Trace#clone_file! creates the temporary file and copies the trace into it.
-
Ci::Trace#create_build_trace! will then open the temporary file in a block that remains open until the upload to object storage is complete.
- CarrierWave::Uploader::Store#store! will attempt to delete the temporary file while in this block
- On NFS, the
FileUtils.rm_f
call will not delete the underlying file because it still has an open file descriptor against it - Instead, the underlying file has a 'silly rename' performed and is is renamed to
.nfsxxxx
.- On an ext4 filesystem the file is deleted in this scenario
- The
rmdir
failure does not trigger an exception because CarrierWave::Storage::File#delete_dir! silently catches the ENOTEMPTY
- When
File.open
block exits the file descriptor is closed, but no further cleanup attempts will be made and the directory is not removed
Steps to reproduce
- Enable object storage for job artifacts
- Enable direct upload for artifacts
- Use an NFS mount for path specified by
gitlab_rails['artifacts_path']
- Run a CI job, after the
ArchiveTraceWorker
finishes successfully the<ARTIFACTS_PATH>/tmp/cache/<CACHE_ID>
directory is still present
What is the current bug behavior?
<ARTIFACTS_PATH>/tmp/cache/<CACHE_ID>
is still present after ArchiveTraceWorker
finishes.
What is the expected correct behavior?
<ARTIFACTS_PATH>/tmp/cache/<CACHE_ID>
is removed successfully.
Relevant logs and/or screenshots
No errors logged.
In strace
we can see the following:
# job.log opened as read-only as fd 357
2150 19:15:19.218372 open("/gitlab-common/shared/artifacts/tmp/uploads/tmp-trace-27548320200417-1670-1ptmb1q/job.log", O_RDONLY|O_CLOEXEC) = 357</gitlab-common/shared/artifacts/tmp/uploads/tmp-trace-27548320200417-1670-1ptmb1q/job.log> <0.009719>
# job.log is moved to the work directory
2150 19:15:19.664076 rename("/nfs/shared/artifacts/tmp/uploads/tmp-trace-27548320200417-1670-1ptmb1q/job.log", "/nfs/shared/artifacts/tmp/work/1587150919-1670-0024-8708/job.log") = 0 <0.043111>
# job.log is moved to the cache directory
2150 19:15:20.010145 rename("/nfs/shared/artifacts/tmp/work/1587150919-1670-0024-8708/job.log", "/nfs/shared/artifacts/tmp/cache/1587150919-1670-0024-8708/job.log") = 0 <0.041765>
# delete job.log, note that fd 357 has **NOT** been closed at this point
2150 19:15:20.499149 unlink("/nfs/shared/artifacts/tmp/cache/1587150919-1670-0024-8708/job.log" ) = 0 <0.021495>
# attempt to remove cache directory
2150 19:15:20.522118 rmdir("/nfs/shared/artifacts/tmp/cache/1587150919-1670-0024-8708" ) = -1 ENOTEMPTY (Directory not empty) <0.022194>
# fd 357 is closed, the name of the target file has changed as job.log is 'deleted'
2150 19:15:20.560965close(357</nfs/shared/artifacts/tmp/cache/1587150919-1670-0024-8708/.nfsaa3dff08e57ae7ad00000a1e>) = 0 <0.000681>
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System information System: Ubuntu 18.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.6.5p114 Gem Version: 2.7.10 Bundler Version:1.17.3 Rake Version: 12.3.3 Redis Version: 5.0.7 Git Version: 2.24.2 Sidekiq Version:5.2.7 Go Version: unknown GitLab information Version: 12.9.4-ee Revision: 6a1a8e88568 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 10.12 Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: yes Omniauth Providers: GitLab Shell Version: 12.0.0 Repository storage paths: - default: /var/opt/gitlab/git-data/repositories GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 12.0.0 ? ... OK (12.0.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes Number of Sidekiq processes ... 1
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab App ...
Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... can't check, you have no projects Redis version >= 2.8.0? ... yes Ruby version >= 2.5.3 ? ... yes (2.6.5) Git version >= 2.22.0 ? ... yes (2.24.2) Git user has default SSH configuration? ... yes Active users: ... 1 Is authorized keys file accessible? ... yes Elasticsearch version 5.6 - 6.x? ... skipped (elasticsearch is disabled)
Checking GitLab App ... Finished
Checking GitLab subtasks ... Finished
Possible fixes
CarrierWave's delete_tmp_file_after_storage
flag triggers the removal of the cache directory while we are holding the file open. This can be disabled, but would probably require a fair amount of work to replace.
It's not entirely clear to me why we are holding the file open. The previous implementation did not hold the file open, but perhaps this was causing problems.