Dedupe `merge_request_diff_files`
merge_request_diff_files contains the entire diff history of every merge request, which means it grows very fast. Previously, we stored this in a serialised YAML column on merge_request_diffs, so we couldn't do much about that, but now we can.
The problem is this:
- I create an MR adding files
a,b, andc, each of which have 100 lines. - That creates three entries in
merge_request_diff_files. - Someone points out that I was meant to add
d, also with 100 lines. - I make that change and push, without changing
a,b, orc. - We insert four more rows in
merge_request_diff_files, with the first three only differing in theirmerge_request_diff_idand (potentially)relative_ordercolumns.
Now that we have separate tables for this, we could denormalise even further by taking a hash of the file's contents, like this:
- We create a new
merge_request_diff_file_contentstable with two columns: -
diff- the equivalent ofmerge_request_diff_files.diffnow. -
hash- a hash (we can use whatever hash function makes most sense) of thediffcolumn, which is indexed. -
merge_request_diff_filesloses thediffcolumn, and gains amerge_request_diff_file_contents_hashforeign key instead.
This is basically reinventing part of git inside our database, but it's pretty simple. Migrating will be hard, though, and we just migrated to merge_request_diff_files in the first place