GitLab Migration - investigate migration slowness & max RSS
Problem
While using GitLab Migration I decided to max a simple test: perform a migration of the same group & projects structure 10 times concurrently and see what happens.
The following group structure was used:
Root group with 2 projects and 1 subgroup. Subgroup has another 2 subgroups. As you can see there are 2 epics, 2 issues, 3 MRs. Pretty small setup. I started the import 10 times using rails console:
Feature.enable(:bulk_import)
Feature.enable(:bulk_import_projects)
10.times { BulkImports::CreateService.new(User.first, [{ source_type: 'group_entity', source_full_path: 'mygroup', destination_name: "helloworld#{rand(1..10000)}", destination_namespace: ''}], { url: 'http://gdk.test:3000', access_token: '<TOKEN>' }).execute }
Each import finished without any failures, however, for such a small source group structure it took such a long time.
- As you can see the times of created_at and updated_at it took almost an hour to perform those 10 imports locally on GDK. Why does it take so long?
- Inspecting the longs closer I can see individual jobs of
BulkImports::PipelineWorkertaking ~30 seconds to complete. That seems like a long time. Why would a worker take such a long time? - After the run, sidekiq admin view shows RSS at ~8GB while Mac Activity Monitor shows 17Gb acquired memory (
😧 ). Are we doing too many allocations within individual jobs that make memory consumption grow?
Possible reasons for migration slowness
- All 10 imports use the same group structure for the migration. Because of that, we execute
BulkImports::ExportRequestWorker10 times. The pipelineworker waits for export to finish, but since it gets overridden 10 times, can it cause delays? I wonder if we should not re-export relations if they're not older than X amount of time, e.g. 1-3 minutes old. - All 10 imports use the same source/destination instance GitLab. It's one gdk instance that is used as source and destination. Can rails-web be slow to respond to sidekiq and creating the overall delays in execution?
- On individual bulk import level we throttle concurrent entities import to 5 at a time. Can this cause additional delays to the overall migration? Should we leave it up to sidekiq to manage maximum concurrency? Can this be a potential cause for noise neighbour situation for other jobs on the sidekiq shard?
- On a pipeline worker level, there's a 1 minute reenqueue delay if the relation export ndjson file is not ready to be downloaded from source. Can we make this a smaller delay?
Possible reasons for memory consumption
- Our transormation elements of the ETL pipelines use
.thensyntax that makes object copy, which can lead to memory growth. We should update all our transformers to minimize use of this method. - In general our pipeline workers operate on a collection of data within a job (e.g. 1 job processes 10k issues, another 10k MRs). We should aim to convert this approach and use '1 object per job' approach, similar to what is done in GitHub Importer. A potential solution is described in #343444 (comment 717516457)
All in all, the migration speed should be faster than this. 10 concurrent imports of this size shouldn't take an hour, even if it's local development performance. The outcome of this issue should be actionable items/issues which can improve GitLab Migration's performance.

