Pluggable object databases in Git
# Summary Git currently stores objects in either of the following two formats: - Loose objects store each object as an individual file. - Packfiles store multiple different objects so that they can be deltified against each other, thus saving storage space when there are a lot of incremental changes. These formats work reasonably well for small- and medium-sized repositories, but they have their limitations: - They are not able to store binary files efficiently. Git does not create deltas for large blobs, so every binary blob is always stored as a full (zlib-compressed) copy. - Repository housekeeping requires us to pack loose objects into packfiles and merge small packfiles into bigger packfiles. In large repositories, this housekeeping can become prohibitively expensive, sometimes taking many hours. - Object metadata, like for example the information stored in commit-graphs, needs to be computed "manually" every once in a while. This has the consequence that the information is often out-of-date. Next to these general limitations, there are also limitations specific to Gitaly: - In a replicated setup we have to store the objects on every Gitaly node. - Gitaly wants to ensure ACID properties via the write-ahead log, but doing so is costly with the current format as Git does not support proper transactions for objects. - We want to have direct access to objects, but that requires us to reimplement access to them unless we use Git. Pluggable object databases are intended to solve these issues. The idea is that there can be multiple different backends for how Git stores objects. With this infrastructure, we can on the one hand iterate on how Git stores objects in general to address modern needs in large Git repositories on the client-side. But we can also implement backends that are specific to Gitaly to address our own needs in a more direct way. Eventually, this could unlock the ability for us to store objects in a distributed database and facilitate direct access to those objects. This opens up new opportunities going forward, like for example instant horizontal scalability of Gitaly nodes. ## Business case We face all kinds of bottlenecks nowadays that derive from the format that Git uses to store objects. By allowing us to use our own backend format to store objects we can address these bottlenecks by directly addressing our own needs. With the introduction of pluggable object databases, we can start to address such bottlenecks more strategically: - We can address performance issues in Gitaly's write-ahead log by introducing transactions into our custom object backends directly. - We can address client-side needs to store large binary files more, e.g. by using content-defined chunking of files. - We can achieve horizontal scalability in a Raft-based world as we do not have to replicate a repository's objects before the new node can serve any traffic. Furthermore, introducing the infrastructure for pluggable object databases allows us to think about major changes to the way objects are stored in the first place. This is in contrast to the current iterative changes that only lead to small incremental performance improvements. We thus minimize the strategic risk that eventually, the current storage format for objects will not scale sufficiently anymore to serve our customers with ever-growing monorepos. ## Exit criteria The exit criteria of this epic is to have the infrastructure ready for an alternative object backend. This requires us to refactor the code base to go via a well-abstracted interface and the introduction of repository extensions. It is explicitly not the goal of this epic to design the new object storage format. This will be handled as a subsequent step. <!-- STATUS NOTE START --> ## Status 2026-06-18 :clock1: **total hours spent this week by all contributors**: 30 :tada: **achievements**: - The "loose" backend has been merged upstream. This is the third backend we have upstreamed by now. The "packed" backend is the last one that we'll have to upstream before we can start abstracting more functionality. - Refactorings for "setup.c" have been merged upstream to centralize creation of the object database. This is a prerequisite for introducing the object storage format extension. :issue-blocked: **blockers**: - Git is in feature freeze now due to the pending Git 2.55 release, so it's expected that upstream progress will be slowed down. :arrow_forward: **next**: - The "packed" backend is currently under review upstream. Once that patch series got merged we have turned all existing sources into proper pluggable backends, which will then allow us to abstract more functionality. This work will then also be a lot easier to parallelize. _Copied from https://gitlab.com/groups/gitlab-org/-/epics/15061#note_3467365753_ <!-- STATUS NOTE END -->
epic