Skip to content

Dedupe `merge_request_diff_files`

merge_request_diff_files contains the entire diff history of every merge request, which means it grows very fast. Previously, we stored this in a serialised YAML column on merge_request_diffs, so we couldn't do much about that, but now we can.

The problem is this:

  1. I create an MR adding files a, b, and c, each of which have 100 lines.
  2. That creates three entries in merge_request_diff_files.
  3. Someone points out that I was meant to add d, also with 100 lines.
  4. I make that change and push, without changing a, b, or c.
  5. We insert four more rows in merge_request_diff_files, with the first three only differing in their merge_request_diff_id and (potentially) relative_order columns.

Now that we have separate tables for this, we could denormalise even further by taking a hash of the file's contents, like this:

  1. We create a new merge_request_diff_file_contents table with two columns:
  2. diff - the equivalent of merge_request_diff_files.diff now.
  3. hash - a hash (we can use whatever hash function makes most sense) of the diff column, which is indexed.
  4. merge_request_diff_files loses the diff column, and gains a merge_request_diff_file_contents_hash foreign key instead.

This is basically reinventing part of git inside our database, but it's pretty simple. Migrating will be hard, though, and we just migrated to merge_request_diff_files in the first place 😞