WIP: Implement `ndjson` support for `import/export`
What does this MR do?
Implement ndjson support for import/export
This implements ndjson and streaming json
support to handle two cases:
- big
project.json(legacy way) - new
.ndjsonformat, where each relation receives a separate file, and each item is stored per-line
This can properly detect old and a new file contents, without any changes to the files, and by maintaining backward compatibility.
This implements a trick to support streaming json writer to append data additively.
This overall when exporting legacy/ndjson or importing ndjson
allows us to have a constant memory for the process,
and also significantly reduces latency of the data processing
due to not escaping to the native.
This does remove the usage of RelationFactory on exporting side.
I believe it is OK trade-off to make.
Performance
Keep in mind that idle memory usage of GitLab is around ~500MB.
The master branch
git checkout b213471f
1.1. Import on master
IMPORT_DEBUG=1 bin/rake gitlab:import_export:import[root,root,gitlabhq-with-issues-4,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 1260.872610532002
Number of SQL calls: 147407
Memory usage: 890.62109375 MiB
GC calls: 2718
GC major calls: 55
Label: process_345
1.2. Export on master
IMPORT_DEBUG=1 bin/rake gitlab:import_export:export[root,root,gitlabhq-with-issues-3,tmp/exports/gitlabhq_with_issues_export_legacy_v2.tar.gz]
Time to finish: 97.66875144900041
Number of SQL calls: 4006
Memory usage: 761.77734375 MiB
GC calls: 199
GC major calls: 26
Label: process_309
pid="process_110"
2. The implement-ndjson branch
git checkout dbcec49a
2.1. Import on implement-ndjson
IMPORT_DEBUG=1 bin/rake gitlab:import_export:import[root,root,gitlabhq-with-issues-5,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 1207.3776693220025
Number of SQL calls: 147418
Memory usage: 671.0703125 MiB
GC calls: 2737
GC major calls: 42
Label: process_378
2.2. Export on implement-ndjson
IMPORT_DEBUG=1 bin/rake gitlab:import_export:export[root,root,gitlabhq-with-issues-3,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 102.0661853370002
Number of SQL calls: 4006
Memory usage: 564.35546875 MiB
GC calls: 199
GC major calls: 23
Label: process_280
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec -
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team