Optimize BulkImports import/export mechanims to improve individual job performances
👀 What
This issue is to discuss our options that improve performance and overall background job duration as well as memory consumption of BulkImports Import & Export workers.
These workers are:
-
BulkImports::RelationExportWorkerthat performs export of a top level relation and uplods it to object storage (e.g. all merge requests of a project) and packs it tomerge_requests.ndjson.gz -
BulkImports::PipelineWorkerthat performs import of mentioned relations
🤔 Why
Big projects import/export of such relations can take a long time. For example, 15k pipelines take:
-
BulkImports::RelationExportWorkermore than 30 minutes to export -
BulkImports::PipelineWorkermore than 30 minutes to import
This affects our availability metrics and the overall goal of background jobs is to be no longer than 5 minutes to complete. Because of a long duration, workers need to be treated differently, by putting them in 'special' sidekiq shards, which is not ideal:
- Export workers need to be put in 'memory bound' shard, that has limited resources, for it to not consume too much RAM
- Import workers need to be put in 'imports' shard, in order for long running jobs to be uninterrupted
Because jobs can run for a long time this causes:
- Extra support issues, when customers either can't export or experience significant delays
- 'Special' sidekiq shards are limiited in resources, which can cause long queue sizes and delays in execution
- Extra maintenance costs, investigations, working with infrastructure, support & proServ teams to unblock customers
- Breaches our error budget SLAs & having to ask for Import team exceptions
🔬 Potential improvements
- Similarly to GitHub Importer, perform import of 1 object per job, instead of 1 job per the entire relation. This reduces the overall job duration significantly, but increases the amount of jobs required. This is acceptable, because the overall aim for backgroun processing is to have smaller more frequent jobs. We've brainstormed ideas on how this can be achieved here.
- Similar to Import side with 1 object per job, perform Export for a single (or a batch) object per job. This also lowers individual job durations and removes the need for 'special treatment'.
- Move away from
tararchives in favour ofzip. Zip allows file reads directly from object storage, without the need for downloading the whole file locally. This is described in detail in this issue. Reading directly from object storage individual files will speed things up and save up disk space. Can we also append to it directly in object storage? That'd make exporting individual objects a much easier task.
Some of the things done can be also applied to file-based Project/Group Import/Export.
Edited by George Koltsov