Normalise and de-duplicate committer and author details in merge_request_diff_commits
In #331523 (comment 583654940), as part of our sharding effort, we found that a lot of data in merge_request_diff_commits is used for storing committer and author details (= name + email). Most of this data is duplicate data.
To reduce the amount of space needed to store this data, we need to normalise and de-duplicate this data. To do so, we need two tables with these structures:
-
merge_request_diff_commit_names-
id(primary key) -
name(text, unique)
-
-
merge_request_diff_commit_emails-
id(primary key) -
email(text, unique)
-
We then need to add the following columns to merge_request_diff_commits:
committer_name_idcommitter_email_idauthor_name_idauthor_email_id
These columns would have foreign keys to the appropriate tables (e.g. committer_name_id points to merge_request_diff_commit_names.id), using the nullify rule. The columns can be nullable, because I believe that commit author/committer details are technically not required (at least they aren't in our current database scheme).
The rule doesn't really matter though, as we'd never delete rows from the commit names/emails tables. Indexes on these columns are also not needed for that reason.
When we create a new row in merge_request_diff_commits, we take the following steps for the committer name, email, the author name, and the author email:
- Run a
SELECTto see if an existing row is present in the appropriate target table (e.g. for the committer name we'd look inmerge_request_diff_commit_names) - If a row exists, use the ID of that row to populate the appropriate column in
merge_request_diff_commits - If no row exists, insert one and catch any unique record errors we may run into. If we run into such an error, run a
SELECTto get the row - Populate the column the same way as step 2
Migration Steps
Migrating the existing data is done as follows:
- A regular migration adds the new tables and columns. Initially these are all empty
- We add a column
name_and_email_migratedtomerge_request_diff_commits. This column is a boolean that defaults tofalse - The code is changed to support reading from both. That is, if the ID columns have a value we read from them; otherwise we read from the old text based columns
- A post-deployment migration schedules the migrating of the existing rows, using the same logic for populating tables as outlines above. For every row migrated we set
name_and_email_migratedtotrue - In the next release, we run a regular migration that gets all rows where
name_and_email_migratedisfalseand migrates them in place. This number should be close to zero (or zero itself), and mostly servers as a backup in case some background migrations didn't finish - In this same release we change the code to only read from the new columns. We also add the necessary rules to ignore the old columns
- In this same release we add a post-deployment migration that removes the old text based columns