Skip to content

De-dup project tree entries

Matthias Käppler requested to merge 27070-dedup-import-json into master

What does this MR do?

Closes #27070 (closed)

Note: This change introduces a feature flag

  • name: dedup_project_import_metadata
  • default: off

The proposal is to arrive at a better utilization of heap memory during imports by removing duplicate entries in the project metadata tree, thus shrinking it in size.

Implementation wise, I introduced a new collaborator ProjectTreeProcessor, which the ProjectTreeRestorer delegates to before handing the relation tree off to RelationTreeRestorer. This makes it easy to swap out implementations e.g. for testing and comparison (to that end I introduced a no-op IdentityProjectTreeProcessor which leaves the tree untouched.)

Preliminary results are not good: it appears that the overall memory usage went up, in contrast to the original hypothesis:

Before

Screenshot_from_2020-01-08_09-41-29

After

Screenshot_from_2020-01-08_09-41-47

It is evident from that profile that the original tree has not shrunk at all; rather, an additional 84M are being allocated in dedup_hash.

I suspect this is because in addition to the original tree, we're building up a new hash that takes care of all the book keeping around which nodes have been visited before. More measurements are required to see whether this can account for the discrepancy.

The dedup_hash method also adds anywhere around 7-12 seconds of additional runtime (I've seen this swing quite wildly between multiple runs), because it needs to traverse a large tree in its entirety.

UPDATE: the above results refer to the original proposal; I found a faster solution, but it also performs worse than not doing any de-duping at all, as explained in this comment

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

  • unit tests
  • run manual tests against gitlabhq project.json (this takes 14s to run so didn't make it a unit test)
  • run full gitlabhq import locally and compare results
  • run gitlabhq import against branch-specific GL instance
  • add feature-toggle
  • only apply optimization if project tree >= 500MB
Edited by 🤖 GitLab Bot 🤖

Merge request reports