in-flight project imports are deleted by import_export_project_cleanup_worker when disk (not object storage) is used
Summary
When GitLab imports a project, it expands the tar.gz
file temporarily to disk.
There's a sidekiq job import_export_project_cleanup_worker
which runs every hour to ensure that these files get deleted, should they be abandoned.
The ImportExportCleanUpService code specifies 1440 minutes (24 hours) as the threshold of this housekeeping.
It uses a unix find
command to delete the files and directories:
find #{path} -not -path #{path} -mmin +#{mmin} -delete
This doesn't appear to work as expected.
It operates on the age of the files inside the tar.gz
file, and does not appear to check whether that export is actually still in use.
Result: once a project export has been expanded onto disk, it potentially has only seconds or minutes before the sidekiq job will delete the files off disk.
Workaround
Self Managed: disable and delete the job in Sidekiq when a project import is required.
This workaround is not permanent. When you restart GitLab (or just sidekiq - gitlab-ctl restart sidekiq
) the job is reinstated. (Verified ONLY on GitLab 13.12.)
- Admin Area -> Monitoring -> Background Jobs ->
Cron
tab - Locate
import_export_project_cleanup_worker
(eg: ctrl-f and search for it)
- Disable it. (using the
Disable
button, far right) this will cause it to drop to the end of the list. - Locate it again, and then
Delete
it. The job will re-enable itself, otherwise.
- Now perform the project import(s).
Steps to reproduce
- Obtain a large project export,
- The QA team test project has these specs: 3.5 Gb Files, 900k commits, 20k issues, 7k MRs, 1.5k labels - the export file is available here (2.7 Gb).
- If it's a new export, wait 24 hours for the datestamps in the
tar.gz
file to age. - locate the storage path:
pp Gitlab::ImportExport.storage_path
- monitor the storage path:
export GLPATH="/var/opt/gitlab/gitlab-rails/shared/tmp/gitlab_exports"
while true ; do
date
find $GLPATH -not -path $GLPATH -mmin +1440 -type f
sleep 120
done
- the sidekiq job runs at :00 every hour, so start the import just before - for example wall clock xx:50.
- The target GitLab instance needs to:
- NOT use object storage
- Be powerful enough and have enough memory to cope with a large project import, but
- NOT be so quick that the import completes! Suggest 4CPUs, and 6-8gb of free RAM once GitLab is running.
- Be running the full suite of Sidekiq jobs.
- Import the project. Using the rake task provides greater observability.
As the housekeeping runs over a shared location, I would expect it to interfere with imports via the UI, via the API, the rake task, or an other mechanism that uses that location on disk.
Example Project
The QA team test project has these specs: 3.5 Gb Files, 900k commits, 20k issues, 7k MRs, 1.5k labels - the export file is available here (2.7 Gb).
What is the current bug behavior?
Sidekiq deletes the in flight project import/export files based on the datestamp of the files.
Details
Any project that takes more than 30-60 minutes to import, this bug manifests itself by making it appear that GitLab either cannot import that project, or will only do so intermittently or partially.
The ways this appears is:
[1] Any element of the project not already processed or imported is no longer available to the import process. Pipelines, for example, are imported after the git repo, merge requests, and issues - so a partial import with no pipelines is a common outcome.
I suspect this ~bug causes the follow misbehavior:
- None of the pipeline related GPT tests work, because the the pipelines were not imported (Customer ticket reference for GitLab team members)
- The Git Repo is imported first, database elements follow on later and are in separate NDJSON files. As these are deleted by the job, they don't get imported, result: partial database import #271596 (closed)
- This bug is not the only cause of this problem.
- The avatar is not imported, because the code in avatar_restorer.rb does not find it. (One of a number of issues raised by one customer, ticket for GitLab team members)
[2] A largish git repo will delay the later stages of the import long enough that all the NDJSON is deleted. This causes the following error (rake task import)
ERROR -- : Exception: Error importing repository into group/group/projectname - invalid import format
- this can appear very arbitrary: a freshly exported project will work, but might hit the 15 hour timeout in stuck_import_job.rb, or it might hit
Exception: 4:Deadline Exceeded
(cause currently unknown) - an attempt to import the project again may be initiated more than 24 hours after the project was exported, the component files therefore immediately qualify for deletion
What is the expected correct behavior?
Housekeeping of these files does not occur until 24 hours has elapsed.
Relevant logs and/or screenshots
Output of checks
This bug was isolated on 13.12.1, but the code is at least four years old. It affects the 13.x release cycle in a particular way because NDJSON exports use multiple files.
I suspect the NDJSON files are locked on disk while in use, so GitLab will finish importing any given NDJSON file, and then find that the other project features don't exist to be imported.
Possible fixes
The problematic code is: https://gitlab.com/gitlab-org/gitlab/-/blob/v13.12.0-ee/app/services/import_export_clean_up_service.rb#L30
I suspect there are more robust ways to determine whether a export can be deleted or not. See #332313 (comment 589903036) for a discussion about using -maxdepth
with find
.