Skip to content

WIP: Implement `ndjson` support for `import/export`

Kamil Trzciński requested to merge implement-ndjson into master

What does this MR do?

Implement ndjson support for import/export

This implements ndjson and streaming json support to handle two cases:

  • big project.json (legacy way)
  • new .ndjson format, where each relation receives a separate file, and each item is stored per-line

This can properly detect old and a new file contents, without any changes to the files, and by maintaining backward compatibility.

This implements a trick to support streaming json writer to append data additively.

This overall when exporting legacy/ndjson or importing ndjson allows us to have a constant memory for the process, and also significantly reduces latency of the data processing due to not escaping to the native.

This does remove the usage of RelationFactory on exporting side. I believe it is OK trade-off to make.

Performance

Keep in mind that idle memory usage of GitLab is around ~500MB.

The master branch

git checkout b213471f

1.1. Import on master

IMPORT_DEBUG=1 bin/rake gitlab:import_export:import[root,root,gitlabhq-with-issues-4,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 1260.872610532002
Number of SQL calls: 147407
Memory usage: 890.62109375 MiB
GC calls: 2718
GC major calls: 55
Label: process_345

1.2. Export on master

IMPORT_DEBUG=1 bin/rake gitlab:import_export:export[root,root,gitlabhq-with-issues-3,tmp/exports/gitlabhq_with_issues_export_legacy_v2.tar.gz]
Time to finish: 97.66875144900041
Number of SQL calls: 4006
Memory usage: 761.77734375 MiB
GC calls: 199
GC major calls: 26
Label: process_309

pid="process_110"

2. The implement-ndjson branch

git checkout dbcec49a

2.1. Import on implement-ndjson

IMPORT_DEBUG=1 bin/rake gitlab:import_export:import[root,root,gitlabhq-with-issues-5,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 1207.3776693220025
Number of SQL calls: 147418
Memory usage: 671.0703125 MiB
GC calls: 2737
GC major calls: 42
Label: process_378

2.2. Export on implement-ndjson

IMPORT_DEBUG=1 bin/rake gitlab:import_export:export[root,root,gitlabhq-with-issues-3,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 102.0661853370002
Number of SQL calls: 4006
Memory usage: 564.35546875 MiB
GC calls: 199
GC major calls: 23
Label: process_280

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by 🤖 GitLab Bot 🤖

Merge request reports