Optimize BulkImports import/export mechanims to improve individual job performances

👀 What

This issue is to discuss our options that improve performance and overall background job duration as well as memory consumption of BulkImports Import & Export workers.

These workers are:

BulkImports::RelationExportWorker that performs export of a top level relation and uplods it to object storage (e.g. all merge requests of a project) and packs it to merge_requests.ndjson.gz
BulkImports::PipelineWorker that performs import of mentioned relations

🤔 Why

Big projects import/export of such relations can take a long time. For example, 15k pipelines take:

BulkImports::RelationExportWorker more than 30 minutes to export
BulkImports::PipelineWorker more than 30 minutes to import

This affects our availability metrics and the overall goal of background jobs is to be no longer than 5 minutes to complete. Because of a long duration, workers need to be treated differently, by putting them in 'special' sidekiq shards, which is not ideal:

Export workers need to be put in 'memory bound' shard, that has limited resources, for it to not consume too much RAM
Import workers need to be put in 'imports' shard, in order for long running jobs to be uninterrupted

Because jobs can run for a long time this causes:

Extra support issues, when customers either can't export or experience significant delays
'Special' sidekiq shards are limiited in resources, which can cause long queue sizes and delays in execution
Extra maintenance costs, investigations, working with infrastructure, support & proServ teams to unblock customers
Breaches our error budget SLAs & having to ask for Import team exceptions

🔬 Potential improvements

Similarly to GitHub Importer, perform import of 1 object per job, instead of 1 job per the entire relation. This reduces the overall job duration significantly, but increases the amount of jobs required. This is acceptable, because the overall aim for backgroun processing is to have smaller more frequent jobs. We've brainstormed ideas on how this can be achieved here.
Similar to Import side with 1 object per job, perform Export for a single (or a batch) object per job. This also lowers individual job durations and removes the need for 'special treatment'.
Move away from tar archives in favour of zip. Zip allows file reads directly from object storage, without the need for downloading the whole file locally. This is described in detail in this issue. Reading directly from object storage individual files will speed things up and save up disk space. Can we also append to it directly in object storage? That'd make exporting individual objects a much easier task.

Some of the things done can be also applied to file-based Project/Group Import/Export.

Edited Jun 24, 2022 by George Koltsov