GitLab Migration - import 1 object per job instead of 1 job per entire relation
Problem
We need to improve performance as well as memory consumption of BulkImports Import & Export workers.
Read epic for detailed information.
Proposed solution
As a first step, GitLab Migration should perform import of 1 object per job, instead of 1 job per the entire relation. This reduces the overall job duration significantly, but increases the amount of jobs required. This is acceptable, because the overall aim for background processing is to have smaller more frequent jobs.
Technical details
- Instead of PipelineWorker processing the whole relation, it can now only download/decompress exported relation from source (e.g. labels) and for each individual object from an ndjson file - enqueue a new worker to process it
- New worker to do the same procedure as PipelineWorker did previously - transform/sanitize the object, convert it to an ActiveRecord model and save
- After enqueuing a lot of jobs to process individual objects we still need to somehow keep track of when all of the objects are processed. Similar to Github Importer, we could use
Gitlab::JobWaiter
functionality, without having to keep track of individual object's import state in the database. - We're still likely to process 'binary file' relations the same way as before, at least until we implement an ability to read individual files from zip directly from object storage
Edited by George Koltsov