Transparently evict repositories to object storage
## Problem to solve Repositories that are inactive, or accessed infrequently, can account for a significant proportion of repository storage on a large GitLab instance. If a repository is accessed infrequently, it may be more cost effective to offload it to object storage during periods of inactivity. ## Further details Gitaly supports the ability to move repositories between different back-end storage solutions, both single node storage solutions and [Gitaly Cluster](https://docs.gitlab.com/ee/administration/gitaly/praefect.html). Ideally, this would enable a user to define a specific storage back-end that was object storage based. Using repository heuristics, it would then be possible to request that Gitaly move a specific repo to a storage location backed by object storage. Implementing this inside Gitaly would allow this to be transparent to users, except for a brief delay while to repository is transitioned to / from object storage. **Note:** this proposal is not to have active hot repositories stored on object storage. Read and write operations would occur on block storage, before the repo is evicted to object storage after a period of inactivity. ## Proposal As a system administrator, I should be able to enable a Gitaly Cluster feature **Evict inactive repositories to object storage**. As someone trying to read or write to a repository, I should not be aware if a repository has been evicted, besides an initial performance penalty which the repository is retrieved from object storage. When enabled, a repository that has not been accessed recently should be evicted to object storage. This could probably use the same format as repository backups to object storage and even share the same object storage bucket. After being evicted, when a request for this repository is received, the repository bundle should be downloaded from object storage transparently before servicing the read or write operation. ### Technical notes/ideas The MVC could be: - an API that triggers the repository eviction - any read/write operation automatically restores the repository - it should be safe to evict an active repository (worst case for user should be a timeout while the repo is immediately re-inflated after being evicted) Note: Repositories cannot easily be accessed / modified while on object storage, so this is mostly applicable for "archived" type projects that are accessed rarely. It should also be pointed out that Git has no native object storage support, so this would need to be built entirely on top of Git within Gitaly, or with upstream patches to the upstream Git project if it was deemed beneficial. Future iterations: - automatic eviction policies similar to automatic rebalancing ## Git vs Gitaly Implementation The concept of object storage does not exist within Git itself. In fact, Git assumes block level storage for its implementation. There are two paths forward here: ### Contribute object storage support to Upstream Git In discussing this option with the team, it is questionable whether or not the Git project would directly benefit from supporting object storage, and therefore it's uncertain if an implementation for supporting object storage would get accepted to the upstream project. This has to do with a couple of factors: 1. Git is both a client and server and this feature is really geared toward large server nodes of Git -- essentially those hosting a Git service such as GitLab. Therefore, the number of users seeking this feature would be quite low. 1. The way Git accesses data, adding a data model on top of file access to pull / put files from object storage is cumbersome and adds a significant complexity to the Git product. 1. Git doesn't really understand a server with multiple repositories. It's instantiated repo by repo. Therefore, the idea of moving whole repositories around between storages really belongs a layer higher. ### Implement this in Gitaly The purpose of the Gitaly project is to extend standard Git storage to be more versatile. Therefore, it seems that implementing object storage support within Gitaly makes sense. Since Gitaly is designed to support the storage needs of numerous repositories, it understands repository level storage and provides existing support for multiple storage backends. Therefore, it appears that it makes the most sense to support repository level object storage in Gitaly.
epic