Make `import/export` Cloud-Native: use remotely stored `.zip` archives
This issue describes a proposal to make import/export
Cloud-Native
and scalable in 12factor app architecture.
Problem
Currently import/export
of GitLab uses .tar.gz
archive. This archive is created by GitLab,
and later is transferred to another GitLab where it is imported.
The problem of .tar.gz
archive is that this is streaming archive, it means that the whole archive
on import needs to be transferred and unpacked. Thus, it works bad for environments where there's
a small temporary storage, and is very inefficient in handling a big archives (which is quite common
for feature like import/export
).
Import process today
Today the import
work as following:
- When user clicks Browse, it picks
.tar.gz
archive, - This archive is being uploaded to GitLab,
- Workhorse intercepts the archive and stores on disk,
- Unicorn receives a link to disk file and transfers it to object storage,
- New project is created and import job on sidekiq is scheduled,
- Sidekiq downloads archive from object storage to persistent
shared
storage, - Sidekiq unpacks the archive to disk, to persistent
shared
storage, - The individual elements from archive are being processed, like
repository.bundle
orproject.json
, - Once consumed, disk data are removed, and object storage archive is removed.
Import process problems
The biggest problems of that process are:
- The need to transfer from Unicorn to Object Storage, this can be solved separately with #37256 (closed),
- The need to transfer from Object Storage to disk when running Sidekiq job,
- The need to extract the archive in Sidekiq job,
- If the Sidekiq job is interrupted at any point it has to be re-executed fully, with data transfer and data removal,
- The single long-running sidekiq job is executed for the whole process. This sidekiq job very often runs for a dozen of minutes.
Inefficiency of current solution
- We need to transfer and extract the same data multiple times,
- We require a big temporary storage or attached storage to be present to store the archive at different times of the process, and also have enough space to store all extracted content,
- We cannot validate ahead of time the contents of the
.tar.gz
, so we might be susceptible to tar bombs, - We do not create
.tar.gz
on-fly on export, this makes the export to be inefficient simiarly to import in terms of the disk space. This can be solved for.tar.gz
and.zip
as well.
Proposal
We have very well figured out handling of .zip
archives. We should prefer to use .zip
instead of .tar.gz
.
.zip
benefits
The - Ability to read the contents of archive ahead of time,
- Ability to read individual files without transferring the whole file.
The ~performance improvement
- Ability to reduce an
I/O
pressure on import process, and thus significantly increase the performance due to reduced I/O, - Workhorse can send archive directly to Object Storage,
- Sidekiq can read individual files from archive stored on Object Storage,
- If Sidekiq is interrupted, we can retry a very specific part of the process,
- We can run in parallel multiple Sidekiq jobs that do process the single import archive,
- The ability to work on remotely stored file reduces the GitLab requirements on internal or shared storage of the system => we do not have to transfer the files to local filesystem,
- The ability to use remote file makes GitLab fully Cloud-Native from the perspective of 12factor definition: GitLab works with Object Storage and does not require any local storage to execute the work (or is fine with a small temporary ephemeral storage).
Prior work
We already have a good tooling implemented in GitLab and Workhorse to handle .zip
archives efficiently:
- gitlab-zip-metadata: this generates a simple metadata with the contents of zip archive, list of files, and their sizes,
-
gitlab-zip-cat: this reads an individual file from
.zip
archive that is either remote (Object Storage stored) or locally stored, accessing ZIP central directory and seeking to decompress individual file.
The example of .zip
random-access feature is present in usage of artifacts:
- Artifacts are validated by
metadata
feature, - Artifacts browsers uses
metadata
to present a list of files, - Individual files are read directly from Object Storage with
gitlab-zip-cat
withraw
feature.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.