Introduce `.ndjson` as a way to process import
Problem to solve
We've previously identified the need to reduce overall memory consumption for both imports and exports. The problems as of now are:
- Peak memory use is a function of project size (specifically the metadata tree encoding it)
- There are known inefficiencies in the current encoding, such as creating duplicate entries
The current solution therefore doesn't scale, since memory use rises with project size (to an extent that we were in some cases unable to process it anymore.)
To address these concerns, we propose to introduce a new data-interchange format (DIF) based on
.ndjson, which would allow us to process imports and exports with approximately constant memory use, i.e. regardless of project size. An early proof-of-concept has shown very promising results. However, this is a complex untertaking, as there are a number of things that need to happen before we can switch over to
- We need to introduce versioning of
import/exportto allow us introduce breaking changes: #35861
- We need to implement
- We need to implement
What follows is an overview of:
- Why we think
ndjsonis a good contender to solving the problem outlined above
- How a project export would be represented in
- An estimate of the memory savings this would afford us
- General risks and impediments for reaching this goal
ndjson (newline-delimited JSON) is a JSON based DIF optimized for streaming use cases. Since plain JSON encodes an entity in a single monolithic tree, it needs to either be interpreted in its entirety or tokenized and streamed in parts. Both approaches are problematic for different reasons: the former because the data needs to be loaded into memory in its entirety (which is inefficient for large data sets; it is the approach we're currently taking), the latter because it operates at a very low level of the data structure, making it cumbersome to deal with from a development point of view.
ndjson instead splits a data structure that can be encoded in JSON into smaller JSON values, where each such value is written and read as a single line of text. An
ndjson formatted stream is therefore not a valid JSON document, but each line in such a stream is.
The format itself is more formally specified here.
This approach has a number of benefits:
- Constant memory use.. Data can be streamed from files or over the wire line-by-line, and already processed lines can be discarded. This means memory use will never be larger than the largest single-line entity, which over a sufficiently large number of projects should--on average--mean roughly constant memory use.
- Familiar concepts & tooling. Each line is a valid JSON document (i.e. either an array or object), so no additional tooling or knowledge is needed to process it, meaning there is much less friction compared to other wire formats like protobuf.
- Checkpointing. The by-line transmission nature of ndjson gives us a natural way to checkpoint imports and exports, which can be used to implement abort/resume as well as progress indicators / ETAs for end users/frontends. We could, for instance, keep a simple file pointer around that we can reset a file to when an import is paused and later resumed.
Representing GitLab project exports
In terms of encoding e.g. exported projects to ndjson, a natural split would be along the "top-level associations" that we currently define for project exports (e.g.
merge_requests, etc; see
import_export.yml for a full list). In this model, each direct child node of
project would be written to a separate file, which in return contains one entity entry per line. This has the benefit that we can follow a familiar structure / schema. Moreover, since we would have one file per top-level association, entity-specific post-processing becomes easier to do, since we could for instance limit or transform the number of merge requests we export without every running the application, purely by text-processing
merge_requests.ndjson. An example break-down:
├── [ 5] auto_devops.ndjson ├── [ 31] ci_cd_settings.ndjson ├── [172K] ci_pipelines.ndjson ├── [ 219] container_expiration_policy.ndjson ├── [ 5] error_tracking_setting.ndjson ├── [139K] issues.ndjson ├── [1.2K] labels.ndjson ├── [2.3M] merge_requests.ndjson ├── [ 5] metrics_setting.ndjson ├── [2.7K] milestones.ndjson ├── [ 316] project_feature.ndjson ├── [1.6K] project_members.ndjson ├── [1.3K] project.ndjson ├── [ 745] protected_branches.ndjson ├── [ 5] service_desk_setting.ndjson └── [ 589] services.ndjson
A challenge with this approach is that the size distribution can be quite uneven, since some relations contain much more data than others.
Expected memory savings
Some early measurements can be found in !23920 (closed)
We expect even large relations to not exceed a few dozen KB in size; since this memory can be released once the relation has been processed, we expect dramatic improvements in memory used. These gains become larger as the project metadata grows. Even with very large individual relations (say 1MB / line), here is how
ndjson would compare to a DOM based approach:
|Total JSON metadata||Max JSON (before)||Max JSON (after)||reduction|
Risks and impediments
A major challenge will be to migrate to the new format without compromising user experience too much, since this constitutes a breaking change, as older archives exported under the current logic would not be compatible with
ndjson. We need to decide to what extent we want to keep supporting the older, less efficient format, or whether we prefer a clean break with instructions for users for how to migrate over. One suggestion that was made to ease this effort could be a dedicated tool that customers can use to transform older project archives to the new format.
There might be a time during which we need to support both formats simultaneously, which will temporarily increase complexity in the code base.
Making sure that we do not compromise existing import/export functionality and correctness is another risk. We have already worked towards a better automated monitoring & testing of import/export related functionality, but especially large imports can break in subtle ways that are difficult to detect.
Finally, since this is a larger effort, we do not expect groupmemory to implement every aspect of it, but rather prepare everything as much as possible and eventually hand over the project to groupimport . However, we already did a high level issue breakdown, which is summarized below.
Below is a proposed break down of the work that needs to get done and how it could be split up in increments that we can deliver.
Track 0: Preparational work
Projectlogic (important, in progress)
- move (dumb) all (almost) code from
- #207846 (closed)
- move (dumb) all (almost) code from
Drop or rework
RelationRenameService(important, #207960 (closed))
- we need to figure out if or how
RelationRenameServicewill carry over to
ndjson, since it duplicates relations it renames, which is not compatible with streaming data
- we might have to drop this -- check with PM
- Similarly, we have recently dropped support for legacy merge request formats in !25616 (merged)
- next step: open MR to remove it
- we need to figure out if or how
Introduce a rake task for synchronous exports (optional, in progress)
export.raketask similar to
- #207847 (closed)
Collect exporter metrics in regular intervals (optional)
- extend our metrics gathering to also include exporting project (similar to importing)
- this could be similar to what we did for imports where results are regularly published via a CI pipeline (TODO: create issue for this)
Track 1: Moving toward ndjson
We can split this track up further to work in parallel on the export and import side of things.
MR1: Export via streaming serializer, introduce "Writer" abstraction
- Introduce streaming serializer as a drop-in replacement for
Writercan persist relations in different ways
- it would still produce a "fat JSON" so no ndjson here yet
- there would be no structural changes here yet, it's mostly a refactor
MR2: Introduce ndjson writer
- this implements a Writer that writes ndjson
- based on a feature flag it can switch between fat or ndjson or just writes both outputs
MR3: (nice to have) Allow to export either using
ndjson or legacy format (but not both)
- See also Track 2 (need to clarify with product how to achieve that, since it requires user input)
MR4: Introduce "Reader" abstraction
Readercan read JSON files for further processing
- its only implementation would be to read fat JSON files
- this implements the Reader that can parse ndjson
- Based on feature flag and/or file format it decides which reader to choose
Track 2: Expose
ndjson import/export to users
Tackle the smallest possible aspect of the #35861.
I would us like to have a way to indicate as part of export request what "format" of export we want. Maybe it should be a two radio buttons under the:
Export project Export this project with all its related data in order to move your project to a new GitLab instance. Once the export is finished, you can import the file from the "New Project" page.
Use legacy version compatible with GitLab < 12.9(internally it would create big json)
Export using new version compatible with GitLab >= 12.9(internally it would create ndjson) (default).
We would continue importing legacy/big json till 13.0, with 13.1 we would remove support for exporting/importing legacy json.
Links / references
POC MR: !23920 (closed)