Make `import/export` Cloud-Native: use remotely stored `.zip` archives

This issue describes a proposal to make import/export Cloud-Native and scalable in 12factor app architecture.

Problem

Currently import/export of GitLab uses .tar.gz archive. This archive is created by GitLab, and later is transferred to another GitLab where it is imported.

The problem of .tar.gz archive is that this is streaming archive, it means that the whole archive on import needs to be transferred and unpacked. Thus, it works bad for environments where there's a small temporary storage, and is very inefficient in handling a big archives (which is quite common for feature like import/export).

Import process today

Today the import work as following:

When user clicks Browse, it picks .tar.gz archive,
This archive is being uploaded to GitLab,
Workhorse intercepts the archive and stores on disk,
Unicorn receives a link to disk file and transfers it to object storage,
New project is created and import job on sidekiq is scheduled,
Sidekiq downloads archive from object storage to persistent shared storage,
Sidekiq unpacks the archive to disk, to persistent shared storage,
The individual elements from archive are being processed, like repository.bundle or project.json,
Once consumed, disk data are removed, and object storage archive is removed.

Import process problems

The biggest problems of that process are:

The need to transfer from Unicorn to Object Storage, this can be solved separately with #37256 (closed),
The need to transfer from Object Storage to disk when running Sidekiq job,
The need to extract the archive in Sidekiq job,
If the Sidekiq job is interrupted at any point it has to be re-executed fully, with data transfer and data removal,
The single long-running sidekiq job is executed for the whole process. This sidekiq job very often runs for a dozen of minutes.

Inefficiency of current solution

We need to transfer and extract the same data multiple times,
We require a big temporary storage or attached storage to be present to store the archive at different times of the process, and also have enough space to store all extracted content,
We cannot validate ahead of time the contents of the .tar.gz, so we might be susceptible to tar bombs,
We do not create .tar.gz on-fly on export, this makes the export to be inefficient simiarly to import in terms of the disk space. This can be solved for .tar.gz and .zip as well.

Proposal

We have very well figured out handling of .zip archives. We should prefer to use .zip instead of .tar.gz.

The `.zip` benefits

Ability to read the contents of archive ahead of time,
Ability to read individual files without transferring the whole file.

The ~performance improvement

Ability to reduce an I/O pressure on import process, and thus significantly increase the performance due to reduced I/O,
Workhorse can send archive directly to Object Storage,
Sidekiq can read individual files from archive stored on Object Storage,
If Sidekiq is interrupted, we can retry a very specific part of the process,
We can run in parallel multiple Sidekiq jobs that do process the single import archive,
The ability to work on remotely stored file reduces the GitLab requirements on internal or shared storage of the system => we do not have to transfer the files to local filesystem,
The ability to use remote file makes GitLab fully Cloud-Native from the perspective of 12factor definition: GitLab works with Object Storage and does not require any local storage to execute the work (or is fine with a small temporary ephemeral storage).

Prior work

We already have a good tooling implemented in GitLab and Workhorse to handle .zip archives efficiently:

gitlab-zip-metadata: this generates a simple metadata with the contents of zip archive, list of files, and their sizes,
gitlab-zip-cat: this reads an individual file from .zip archive that is either remote (Object Storage stored) or locally stored, accessing ZIP central directory and seeking to decompress individual file.

The example of .zip random-access feature is present in usage of artifacts:

Artifacts are validated by metadata feature,
Artifacts browsers uses metadata to present a list of files,
Individual files are read directly from Object Storage with gitlab-zip-cat with raw feature.

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited Apr 26, 2023 by 🤖 GitLab Bot 🤖