Skip to content

in-flight project imports are deleted by import_export_project_cleanup_worker when disk (not object storage) is used

Summary

When GitLab imports a project, it expands the tar.gz file temporarily to disk.

There's a sidekiq job import_export_project_cleanup_worker which runs every hour to ensure that these files get deleted, should they be abandoned.

The ImportExportCleanUpService code specifies 1440 minutes (24 hours) as the threshold of this housekeeping.

It uses a unix find command to delete the files and directories:

find #{path} -not -path #{path} -mmin +#{mmin} -delete

This doesn't appear to work as expected.

It operates on the age of the files inside the tar.gz file, and does not appear to check whether that export is actually still in use.

Result: once a project export has been expanded onto disk, it potentially has only seconds or minutes before the sidekiq job will delete the files off disk.

Workaround

Self Managed: disable and delete the job in Sidekiq when a project import is required.

This workaround is not permanent. When you restart GitLab (or just sidekiq - gitlab-ctl restart sidekiq) the job is reinstated. (Verified ONLY on GitLab 13.12.)

  • Admin Area -> Monitoring -> Background Jobs -> Cron tab
  • Locate import_export_project_cleanup_worker (eg: ctrl-f and search for it)

image

  • Disable it. (using the Disable button, far right) this will cause it to drop to the end of the list.
  • Locate it again, and then Delete it. The job will re-enable itself, otherwise.

image

  • Now perform the project import(s).

Steps to reproduce

  • Obtain a large project export,
    • The QA team test project has these specs: 3.5 Gb Files, 900k commits, 20k issues, 7k MRs, 1.5k labels - the export file is available here (2.7 Gb).
  • If it's a new export, wait 24 hours for the datestamps in the tar.gz file to age.
  • locate the storage path: pp Gitlab::ImportExport.storage_path
  • monitor the storage path:
export GLPATH="/var/opt/gitlab/gitlab-rails/shared/tmp/gitlab_exports"
while true ; do
  date
  find $GLPATH -not -path $GLPATH -mmin +1440 -type f
  sleep 120
done
  • the sidekiq job runs at :00 every hour, so start the import just before - for example wall clock xx:50.
  • The target GitLab instance needs to:
    • NOT use object storage
    • Be powerful enough and have enough memory to cope with a large project import, but
    • NOT be so quick that the import completes! Suggest 4CPUs, and 6-8gb of free RAM once GitLab is running.
    • Be running the full suite of Sidekiq jobs.
  • Import the project. Using the rake task provides greater observability.

As the housekeeping runs over a shared location, I would expect it to interfere with imports via the UI, via the API, the rake task, or an other mechanism that uses that location on disk.

Example Project

The QA team test project has these specs: 3.5 Gb Files, 900k commits, 20k issues, 7k MRs, 1.5k labels - the export file is available here (2.7 Gb).

What is the current bug behavior?

Sidekiq deletes the in flight project import/export files based on the datestamp of the files.

Details

Any project that takes more than 30-60 minutes to import, this bug manifests itself by making it appear that GitLab either cannot import that project, or will only do so intermittently or partially.

The ways this appears is:

[1] Any element of the project not already processed or imported is no longer available to the import process. Pipelines, for example, are imported after the git repo, merge requests, and issues - so a partial import with no pipelines is a common outcome.

I suspect this ~bug causes the follow misbehavior:

  • None of the pipeline related GPT tests work, because the the pipelines were not imported (Customer ticket reference for GitLab team members)
  • The Git Repo is imported first, database elements follow on later and are in separate NDJSON files. As these are deleted by the job, they don't get imported, result: partial database import #271596 (closed)
    • This bug is not the only cause of this problem.
  • The avatar is not imported, because the code in avatar_restorer.rb does not find it. (One of a number of issues raised by one customer, ticket for GitLab team members)

[2] A largish git repo will delay the later stages of the import long enough that all the NDJSON is deleted. This causes the following error (rake task import)

ERROR -- : Exception: Error importing repository  into group/group/projectname - invalid import format
  • this can appear very arbitrary: a freshly exported project will work, but might hit the 15 hour timeout in stuck_import_job.rb, or it might hit Exception: 4:Deadline Exceeded (cause currently unknown)
  • an attempt to import the project again may be initiated more than 24 hours after the project was exported, the component files therefore immediately qualify for deletion

What is the expected correct behavior?

Housekeeping of these files does not occur until 24 hours has elapsed.

Relevant logs and/or screenshots

Output of checks

This bug was isolated on 13.12.1, but the code is at least four years old. It affects the 13.x release cycle in a particular way because NDJSON exports use multiple files.

I suspect the NDJSON files are locked on disk while in use, so GitLab will finish importing any given NDJSON file, and then find that the other project features don't exist to be imported.

Possible fixes

The problematic code is: https://gitlab.com/gitlab-org/gitlab/-/blob/v13.12.0-ee/app/services/import_export_clean_up_service.rb#L30

I suspect there are more robust ways to determine whether a export can be deleted or not. See #332313 (comment 589903036) for a discussion about using -maxdepth with find.

Edited by Ben Prescott (ex-GitLab)