Change repository assignment data model (!2848) · Merge requests · GitLab.org / gitaly

Sami Hiltunen requested to merge smh-change-assignment-model into master Nov 26, 2020

In 243dc385, assigned column was added to the storage_repositories table. Added ahead of any implementation around it, it turned out to be worse choice to model the assignment information as an additional column instead of a separate table. The original idea behind adding a new column instead of a new table was mostly the following:

Most entries in the storage_repositories table refer to assigned nodes.
It's a smaller schema change to add the column.
It may be more performant and the queries may be easier to understand when there's less joining.

In practice this has proven to be more complex than necessary though.

Assignments and generations are logically completely separate. Storing the information in the same table forces both of the components to be aware of each other. We have to be careful not to delete the row if it is still assigned when we removed the repository from the storage. Likewise, when we remove the assignement, we have to be careful not to affect the generation value stored. Additionally, if there is not generation value nor an assignment, we'd have to delete the record. This complexity stems purely from having the data in the same record.

This forces both component's queries to be aware of each other which makes implementing and understanding them that much more difficult. This also leads to a lot more code work as we have to update every existing query to be aware of this.

With the records in a separate table, working with the data becomes easier. When we want to delete an assignment, we can just delete the row without considering any of the other data on there, like the generation.
As it is, it's not possible to model the current behavior of Praefect with the assignments to provide a graceful fallback. Currently the column defaults to true. That locks every repository in their current replication factor when assignments are finally beginning to be used. This is distinct from Praefect's current behavior where every repository is replicated to every node.

Having assignments in a separate table allows us to model unset replication factor as having no assignments for the repository. This allows us to enable variable replication factor on a per repository basis and later start enforcing it for every repository. We could model this also by making the assigned column nullable but that's where it begins to look like we should have separate tables when all of the relevant values are nullable.
We can use foreign key constraints to enforce data correctness. For example, we can have a foreign key constraint from the assignments to the repositories table to ensure the assignments always refer to repositories which should exist in the virtual storage. This is not possible with the storage_repositories table as some of the records may point to copies of deleted repositories still present on the storage nodes but logically not present on the virtual storage. Another constraint that could be put in place to enforce the repository's primary node is always assigned.

To better separate the concenrs between the different components, keep the code and queries more understandable and to better leverage the database feature, let's fix the schema before it becomes a problem after deploying the variable replication factor.

Related to #2971 (closed)

Change repository assignment data model

Merge request reports