Generate repository paths in Gitaly (#8132) · Epics · GitLab.org

Generate repository paths in Gitaly

# Background Currently, Rails addresses each repository by the `(storage, relative_path)` tuple. Rails is responsible for generating the `relative_path` and does so following the [hashed storage schema](https://docs.gitlab.com/ee/administration/repository_storage_types.html#hashed-storage). For Gitaly and Gitaly Cluster, this means that the client controls the storage paths of the repositories. This feels backwards and complicates a number of things. The storage service should be responsible for making storage related decisions such as where repositories are stored on the disk. Some of the complications include: Each of the create, rename and delete repository operations have to be applied both on the disk and the database. It's possible that they get only partially applied, where the operation has been performed on the disk already but the database updates have not been performed. This necessiates conflict handling logic like checking whether a repository exists on the disk already or not, has it already been deleted and so forth. Renames in particular can be dangerous, as the repository is moved and thus out of sync database records can cause the repository to be lost. While this conflict resolution logic may work reasonably well with a single Gitaly, it's still error prone and one needs to make sure it is properly handled everywhere (example https://gitlab.com/gitlab-org/gitaly/-/issues/2034). The conflict handling doesn't work in Gitaly Cluster as inconsistencies can arise between the internal storages which Rails has no access to. Gitaly Cluster fixed these inconsistencies with https://gitlab.com/gitlab-org/gitaly/-/issues/3485 by handling repository creations, deletions and renames in a manner that can't cause conflicts. The consistency fixes have been applied behind the current interface by virtualizing the relative paths. The fixes only benefit keeping the Praefect database in sync with the internal storages of the cluster. These consistency problems still exist in the interface between Rails and the storages (Gitaly / Gitaly Cluster). Repositories having unique, permanent ID made it possible to handle the problematic operations in an atomic manner. For details, see the [documentation](https://docs.gitlab.com/ee/administration/gitaly/#atomicity-of-operations). Any sort of storage schema changes are also difficult with Rails generating the relative paths. If the repository's location on disk changes, the repository's records in the Rails database have to also be moved. The storage details are leaking to Rails and not abstracted away. All of the storage details that leak from the interface are amplified by Gitaly Cluster. If Rails handles something rather than relying on Gitaly, that logic has to be duplicated in Praefect for it to be correctly applied across the internal storages. Good examples of this is the relative path generation logic in Praefect now and the repository maintenance logic that had to be replicated. It would be beneficial to contain as much of the storage details in Gitaly as possible. Moving the repository path handling to Gitaly would allow us to simplify the architecture, simplify the interface by removing the path handling logic from Rails and Praefect and apply the same consistency fixes in Rails we now have in Praefect. Instead of client providing a relative path on repository creation, it simply creates a repository, Gitaly assigns it a unique ID and returns it on completion. The clients would always refer to the repository with this permanent ID. # Relevance with current work In https://gitlab.com/gitlab-org/gitaly/-/issues/3965, Praefect needs unique IDs for each of the replicas. This would allow us to implement replica deletions also as an atomic operation. Due to them not being atomic, we currently have the `delete_replica` job that is never dropped. Reducing the number of different replication jobs we have helps with https://gitlab.com/gitlab-org/gitaly/-/issues/4214 as we have less jobs to model in the database. There are also other locations, like the leader election, that could be simplified if we don't have to check for this pending delete. While this could be implemented behind the current interface as well, that would require pushing more path handling logic to Praefect. While virtual repository IDs added in https://gitlab.com/gitlab-org/gitaly/-/issues/3485 are still necessary with the proposed architecture where Gitaly generates the repository IDs, the replica IDs generated by Praefect are an internal detail and would not need to be virtualized. The replica IDs in the proposed architecture would be the physical repository IDs generated by Gitaly. This would thus be stepping on Gitalys toes in the planned architecture. This change also serves to remove complexity in the current system by simplifying the interfaces. # Implementation A repository needs a unique, permanent ID in the context of a single Gitaly storage. This is also sufficient to derive a unique path for the repository from the repository ID in the same manner as [Praefect currently does](https://docs.gitlab.com/ee/administration/gitaly/#praefect-generated-replica-paths-gitlab-150-and-later). As this is local to a Gitaly, it suffices to store the repository ID sequence in a file in each storage. For example, we could store the sequence in a file at `<storage-root>/+gitaly/repository-id.sequence`. The content of the file is the last successfully acquired ID, for example `5`. To acquire a new unique ID, Gitaly would increment the number in the file in an atomic manner. This ID could then be used to derive the a unique path in the storage, for example `@repositories/e7/f6/6`. Gitaly needs to identify object pools from other repositories. That could be done by using a unique prefix directory from the other repositories, for example `@pools`, as is done currently by Praefect and Rails. The generated repository ID would be returned after a successful completion of [one of the repository creating RPC](https://gitlab.com/gitlab-org/gitaly/-/blob/ab0e402a9e910d99b13ef3f1631844d5fa386c97/internal/praefect/coordinator.go#L155-160). The clients would then access the repository by its (`storage`, `repository_id`). Relative path would eventually be removed from the interface. This scheme allows the path handling to remain local to Gitaly thus simplifying Rails and Praefect and leaving Gitaly better in control of the implementation details of the storage. ## Praefect Praefect already generates internally a virtual repository IDs for all the repositories on the cluster. Praefect would return the Praefect generated virtual repository ID to the client when a repository is created. It would internally use the Gitaly generated physical repository IDs as the replica IDs. Praefect currently does not support the replicas having different paths. Some work would be necessary to ensure each code path handles each replica having a unique ID properly. The benefit is that we can begin deleting replicas also atomically as described in https://gitlab.com/gitlab-org/gitaly/-/issues/3965. On proxying a request Praefect does a look up from a `(virtual_storage, relative_path)` to a virtual repository ID. The virtual repository ID is used for all internal operations and would remain so. When sending requests to Gitaly, Praefect is currently rewriting the `(virtual_storage, relative_path)` to `(storage, replica_path)`. The Gitaly generated repository IDs are used as the unique replica ID in Praefect, so the proxied messages would be rewritten to `(storage, replica_id)` instead. ## Migration Migration is likely the biggest aspect of this work. All of the repositories have to ultimately be assigned an ID by Gitaly, moved on the disk to the new location and have their records in Praefect and Rails updated. There are some steps that we can take to make this more iterative. We could implement the logic on the side in Gitaly in a manner that it can be opted into. If opted into, `CreateRepository*` calls assign a repository ID and return it as part of the call. If Gitaly's RPCs are called with the repository ID present in the repository message, they locate the repository based on the repository ID. This allows for initially developing the functionality in Gitaly and separately testing it before it is plugged into a live system. Afterwards, we could adapt Praefect to work with the Gitaly generated repository IDs and opt in to the behavior. Praefect already rewrites the messages, so rolling out the changes first behind Praefect should be easier as we can keep Rails out of the picture still. The Gitaly generated repository IDs could be used with new repositories to begin with. After the changes have been validated, we could migrate the existing repositories to Gitaly generated IDs. Once the changes have been validated with Gitaly and Prafect, we could begin looking at updating Rails to rely on the storage service to generated the repository IDs for new repositories. Eventually, we'd need to perform a migration that has Gitaly assign a repository ID for each existing repository, move them to the location Gitaly expects them to be stored at according to the new ID.

epic