Skip to content

Normalise and de-duplicate committer and author details in merge_request_diff_commits

In #331523 (comment 583654940), as part of our sharding effort, we found that a lot of data in merge_request_diff_commits is used for storing committer and author details (= name + email). Most of this data is duplicate data.

To reduce the amount of space needed to store this data, we need to normalise and de-duplicate this data. To do so, we need two tables with these structures:

  • merge_request_diff_commit_names
    • id (primary key)
    • name (text, unique)
  • merge_request_diff_commit_emails
    • id (primary key)
    • email (text, unique)

We then need to add the following columns to merge_request_diff_commits:

  • committer_name_id
  • committer_email_id
  • author_name_id
  • author_email_id

These columns would have foreign keys to the appropriate tables (e.g. committer_name_id points to merge_request_diff_commit_names.id), using the nullify rule. The columns can be nullable, because I believe that commit author/committer details are technically not required (at least they aren't in our current database scheme).

The rule doesn't really matter though, as we'd never delete rows from the commit names/emails tables. Indexes on these columns are also not needed for that reason.

When we create a new row in merge_request_diff_commits, we take the following steps for the committer name, email, the author name, and the author email:

  1. Run a SELECT to see if an existing row is present in the appropriate target table (e.g. for the committer name we'd look in merge_request_diff_commit_names)
  2. If a row exists, use the ID of that row to populate the appropriate column in merge_request_diff_commits
  3. If no row exists, insert one and catch any unique record errors we may run into. If we run into such an error, run a SELECT to get the row
  4. Populate the column the same way as step 2

Migration Steps

Migrating the existing data is done as follows:

  1. A regular migration adds the new tables and columns. Initially these are all empty
  2. We add a column name_and_email_migrated to merge_request_diff_commits. This column is a boolean that defaults to false
  3. The code is changed to support reading from both. That is, if the ID columns have a value we read from them; otherwise we read from the old text based columns
  4. A post-deployment migration schedules the migrating of the existing rows, using the same logic for populating tables as outlines above. For every row migrated we set name_and_email_migrated to true
  5. In the next release, we run a regular migration that gets all rows where name_and_email_migrated is false and migrates them in place. This number should be close to zero (or zero itself), and mostly servers as a backup in case some background migrations didn't finish
  6. In this same release we change the code to only read from the new columns. We also add the necessary rules to ignore the old columns
  7. In this same release we add a post-deployment migration that removes the old text based columns
Edited by Craig Gomes