Refactor Project Exports to use Newline delimited JSON
Summary
Currently our project export process uses JSON for project export, which involves building the entire tree in memory before exporting. For larger exports this can cause high memory usage and often failure on export. See https://gitlab.com/gitlab-org/gitlab-ce/issues/35389
We should consider switching over to new-line delimited JSON (NDJSON) files for project exports.
This is the format that the "export your data" functionality on many social networks uses.
Basically, the export is a zip (or tar) file with a bunch of files, for example: issues.ndjson
, merge_requests.ndjson
etc.
Why is this better than a single JSON file?
- It is possible to easily stream database entities into newline delimited JSON format without consuming lots of memory.
- Each entity can be streamed from the database, serialised to a single JSON line and emitted to the file.
- The entire export does not need to be built up in memory
At the same time, this approach still holds the advantages of JSON:
- Easy to manipulate using common tools like
jq
- An open scheme with support for the same datatypes as JSON
Improvements
- Reduced memory consumption
- Reduced failure rates, especially on large exports
Risks
- We will need to be able to convert old exports to the new format so as not to break import compatibility
- Do we need to update import functionality to accept this new format?
Edited by 🤖 GitLab Bot 🤖