Remove merge request diffs from the import

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem to solve

Merge request diffs are (usually) stored in the database, and can be very large. They are included in project exports, which increases their size, as well as the amount of time an import takes to run.

The diffs themselves are duplicate data - the bundle stored in the project repository already has everything they contain.

Target audience

Further details

We already store references to every merge request version, using the refs/keep-around system, so the git repository is certain to have everything we're interested in, right?

Proposal

Stop storing the merge_request_diff* tables in project exports, or importing them on project import. Instead, regenerate those tables - without data loss - from the git repository.

I think the easiest method would be to store the information about which versions are which in the refs/merge-requests reference hierarchy. For instance, we could have:

refs/merge-requests/1/versions/latest
refs/merge-requests/1/versions/1
refs/merge-requests/1/versions/2
# ...

(since we already have refs/merge-requests/1 as a file, this exact layout isn't possible, but you get the idea).

From this information, I believe, we can, at project import time, reconstruct the entire merge_request_diffs table, and its two children - merge_request_diff_commits and merge_request_diff_files.

This also has positive characteristics for the project import itself - it becomes less vulnerable to security issues, conceptually simpler, and more resilient to churn in those tables as we push forward with more features, like external diffs.

We could also begin removing diffs from the database when an MR is closed or merged, without losing any of the information required to show that diff again in the future.

What does success look like, and how can we measure that?

Project export archives become smaller without any degradation of functionality.

Links / references

cc @fjsanpedro @DouweM @smcgivern

Edited by 🤖 GitLab Bot 🤖