Dedupe `merge_request_diff_files`
merge_request_diff_files
contains the entire diff history of every merge request, which means it grows very fast. Previously, we stored this in a serialised YAML column on merge_request_diffs
, so we couldn't do much about that, but now we can.
The problem is this:
- I create an MR adding files
a
,b
, andc
, each of which have 100 lines. - That creates three entries in
merge_request_diff_files
. - Someone points out that I was meant to add
d
, also with 100 lines. - I make that change and push, without changing
a
,b
, orc
. - We insert four more rows in
merge_request_diff_files
, with the first three only differing in theirmerge_request_diff_id
and (potentially)relative_order
columns.
Now that we have separate tables for this, we could denormalise even further by taking a hash of the file's contents, like this:
- We create a new
merge_request_diff_file_contents
table with two columns: -
diff
- the equivalent ofmerge_request_diff_files.diff
now. -
hash
- a hash (we can use whatever hash function makes most sense) of thediff
column, which is indexed. -
merge_request_diff_files
loses thediff
column, and gains amerge_request_diff_file_contents_hash
foreign key instead.
This is basically reinventing part of git inside our database, but it's pretty simple. Migrating will be hard, though, and we just migrated to merge_request_diff_files
in the first place