Make project imports be of constant memory
Need
We are still facing the problem that as projects grow in size, so do the memory requirements of the project importer. The relationship is roughly linear, since the project meta data (currently a tree data structure) needs to reside in memory in its entirety for the duration of the import, ergo larger project = more memory used.
We would like to move away from this approach for several reasons, a major one being to arrive at more predictable memory use of project imports, ideally it being roughly constant for the duration of the import, which is described in this issue.
Approach
Project meta-data is currently encoded in JSON and is front-load using a DOM based JSON parser. We think that instead we should break down one large problem into many smaller problems by processing the JSON file in discrete chunks instead. There are many ways to do this, such as:
-
Chopping up the
project.json
files into several smaller files. These would then be processed individually and discarded when done, leading to lower peak memory use. This requires us to find clear cut-off points within the tree, such that any single file must at most rely on other project data that had been ingested in a previous iteration. - Moving to a streaming based JSON parser. These kinds of parsers emit JSON nodes as they traverse a file, so that an application can act on them immediately via callbacks. This is a very common option for processing large JSON files (gigabytes rather than megabytes) and there are many efficient libraries out there implementing this technique.
-
Moving to a different encoding. Yet another approach could be to change the encoding of the project tree from being an actual tree to being row based. One such encoding is
ndjson
, and we have a proposal for how to implememt it already.
Benefits
The main benefits of this proposal are increases in scalability, reliability and predictability: regardless of project size, the memory requirements will be roughly equal. This means that we can scale to massive projects without buying new hardware or risking to run into memory limits that result in job failures, meaning higher reliability. With constant memory also comes better predictability, which helps in cluster sizing.
There is another benefit that is being unlocked by this, but would require more work, and that is resumability. If we had an importer that operates in discrete chunks, it becomes possible to identify checkpoints that we can persist and from which we can resume should an import fail.
Competition
TODO: Move any options we decide against from 'Approach' here with an explanation why they were ruled out.