Use ndjson to migrate complex relations
## Context ~"group::import" is working on a better Gitlab-to-Gitlab migration experience (https://gitlab.com/groups/gitlab-org/-/epics/2771). The main goal is to provide a way to migrate between Gitlab instances *with one click*, where the user doesn't have to handle files. To achieve that, the backend engineers decided to use Gitlab GraphQL as the way to extract data from the source Gitlab instance. ## The problem: Deeply nested associations _This problem was discussed originally in https://gitlab.com/gitlab-org/gitlab/-/issues/326757_ Gitlab's data-structure is a tree, for a Group it's something like: ``` - Group # Level 0 - Group Labels # Level 1 - Epics # Level 1 - Labels # Level 2 - AwardEmoji # Level 2 - Events # Level 2 - Notes # Level 2 - AwardEmoji # Level 3 ``` *We usually refer to Level 2+ entities as subrelations* #### **Level 0 and Level 1** (Example: Migrate Group or Epic) Data from this level, up to now, has being very straightforward to migrate without big challenges #### **Level 2+** (Example: Migrate Epic's subrelations) Migrating data from this level start to have 2 main challenges: 1. Dependency of the level above, Level N-1 1. multiple-level pagination. For instance, we have to paginate epics and then the epic-subrelation Our current approach is, for each Epic subrelation type (Notes, Events, etc) we iterate on the Epics and then fetch all the subrelation data, which might be paginated. Given that we have 5000 Epics, this means that we'll do **at least** 5000 network requests to the Gitlab Source Instance for each subrelation type, it might be more if some Epics has more subrelations than a page size. This generates a lot of web requests, which is a performance problem, and will probably face rate limiting problems as well. During the evaluation of the problem it was suggested to use GraphQL union to solve this problem, but even if we fetch more than one subrelation type on the same query we might end up with a lot of web requests. To exemplify that I created, locally, the following, extreme, scenario here: ([internal spreadsheet link](https://docs.google.com/spreadsheets/d/1UUEdXjQIUr5NUM8Cdi_GvsFQrdgJ2o18b4fbmTJ1Mlo/edit?usp=sharing)) <img alt="image" src="/uploads/4679e3c0628a092609827ba20940c57a/image.png" width="1000"/> ## Proposed Solution Use ndjson files for complex relations, relations that have deeply nested associations. ### Why not GraphQL? When we decided to use GraphQL for this project, the idea was to use a way to extract data from Gitlab that was used for other users as well, this would be helpful to have a well maintained point of extraction. We knew then that GraphQL didn't have all the required APIs for a whole group migration, but the idea was to contribute to the GraphQL evolution with the Migration necessities. One thing that we didn't know though, was how large deeply nested relations can be hard to query with GraphQL. Group migration get's complex with the Epics and Boards, but Projects migration would have even more complex queries with Merge Requests, Issues and CI pipelines. All that said, it's clear now that GraphQL is very useful to fetch flat-ish information. But, for some large and deeply nested data the number of web-requests and complexity of the GraphQL queries would be hard to maintain and degrade the migration performance. ### Why ndjson? It will require way less web-requests then GraphQL and won't have the GraphQL Query Complexity problem. From the example of the [image above](#level-2-example-migrate-epics-subrelations), I seeded my local database with those numbers of objects and created a Group export, it generated quite a large file. Extracting only the `epics.ndjson` and re-compressing it: ``` $ du -h epics.ndjson* 4.1G epics.ndjson 256M epics.ndjson.tar.gz ``` ### Mixed approach Introducing ndjson doesn't mean we'll move to use only ndjson. The goal is, instead, to *"Use the right tool to the right problem"*. As mentioned, GraphQL is great and will still be part of our "utility belt", but re-introducing ndjson as a way to import large amount of complex data will help us to deliver the best value with the best performance to the user. #### Approach ```mermaid sequenceDiagram User ->>+ Destination: Import From Source <br> (credentials) Destination -->>- User: OK par Destination ->>+ Source: Import _Flat_ Relations and Destination ->>+ Source: Generate epics.ndjson end Source -->>- Destination: OK Source -->>- Destination: OK note over Destination, Source: Order the epics.ndjson file <br> and passes the callback information <br> where Source can send the file <br> when it's done. <br>---<br> The file can be generated while <br> the _flat_ relations are imported. Destination ->>+ Source: Download epics.ndjson Source ->>- Destination: epics.ndjson activate Destination Destination -->> Destination: process epics.ndjson Destination -->> User: Done deactivate Destination ``` #### Pros * ~"group::import" developers already have expertise on the ndjson file creation * Most of the POC code already existed in the Project/Group Export/Import * Minimum number web requests * Since we generate files with the subrealtions, and subrelations-subrelations #### Concerns The current file based Export/Import have some known problems: * File implications: Currently we have problems with file size limits, packing/unpacking files, etc - to avoid that we'll use individual files for the complex relations, which can be paginated increasing the resilience of the migration; * Deeply nested validations generate *cryptic* error messages - this was discussed in the [POC](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/58404#note_547452706) and it's can be approached in the implementation detail. ### Next steps :footprints: 1. Keep the current *flat* relations that are independently imported, like: `Group`, `GroupLabels`, `Members`, `Milestones`, `Iterations` and `Badges`. Besides, with the introduction of the concurrent model (https://gitlab.com/gitlab-org/gitlab/-/merge_requests/57153) most of them will be able to run in parallel which will avoid any possible performance issues. 1. Change the Epics and Epics dependent importation process to the ndjson file approach. 1. Add an endpoint to export *one* relation file. The existing `POST /groups/:id/export` endpoint (https://docs.gitlab.com/ee/api/group_import_export.html#group-importexport-api) exports the whole group. We need something like `POST /groups/:id/export/:relation`, which would generate only the `:relation` file. * `:relation` - Would be a key from `tree/group` or `ee/tree/group` in https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/import_export/group/import_export.yml. So, for groups this would be used for `epics` and `boards`, which is the group subrelations that contains many deeply nested subrelations. * The file is generated and the its URL (ObjectStorage) is send back in the callback. 1. When the Destination receives the request with the file URL * if the previous stages are finished (stage are being introduced in the concurrent model https://gitlab.com/gitlab-org/gitlab/-/merge_requests/57153) * Process the file with the right `BulkImports::Pipeline` * The extraction phase would be downloading the file from the source
epic