Dedupe `merge_request_diff_files` (#19420) · Issues · GitLab.org / GitLab

Dedupe `merge_request_diff_files`

<details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=19420) </details>  `merge_request_diff_files` contains the entire diff history of every merge request, which means it grows very fast. Previously, we stored this in a serialised YAML column on `merge_request_diffs`, so we couldn't do much about that, but now we can. The problem is this: 1. I create an MR adding files `a`, `b`, and `c`, each of which have 100 lines. 2. That creates three entries in `merge_request_diff_files`. 3. Someone points out that I was meant to add `d`, also with 100 lines. 4. I make that change and push, without changing `a`, `b`, or `c`. 5. We insert four more rows in `merge_request_diff_files`, with the first three only differing in their `merge_request_diff_id` and (potentially) `relative_order` columns. Now that we have separate tables for this, we could denormalise even further by taking a hash of the file's contents, like this: 1. We create a new `merge_request_diff_file_contents` table with two columns: 1. `diff` - the equivalent of `merge_request_diff_files.diff` now. 2. `hash` - a hash (we can use whatever hash function makes most sense) of the `diff` column, which is indexed. 2. `merge_request_diff_files` loses the `diff` column, and gains a `merge_request_diff_file_contents_hash` foreign key instead. This is basically reinventing part of git inside our database, but it's pretty simple. Migrating will be hard, though, and we just migrated to `merge_request_diff_files` in the first place :disappointed:

issue