Implement a way to figure out where quarantined objects connect to the repository
Write-ahead logging in Gitaly (&8911) requires that all writes to the repositories go through the log. This also means that no uncommitted object should end up in the repository without being committed through the log. When the pack files are in the log, the objects the pack file depends on could be pruned. Only objects that are reachable from references are guaranteed to stay in place. If objects the pack file depends upon are pruned, it will no longer apply to the repository causing the log application to fail.
The first iteration of the pack file logging is including all objects that become newly reachable from the new reference tips being written in the transaction. This ensures that the unreachable objects won't be pruned while the pack file is in the log and the reference updates haven't been applied yet. However, this is very expensive as this requires walking the object graph of the repository to figure out which objects need to be included in the pack. This operation scales by the size of the object graph and the number of references in the repository.
We need more efficient way to do this. Minimally we'd have to walk the new objects in the quarantine directory to ensure they are valid and connect. However, we could also assume that the objects in the repository are already valid. If not, the repository would already be corrupted. Given this, we'd ideally have a way to walk the new objects, and stop as soon as we've verified they are valid and connect to the repository's object database. The connection points are the objects that a pack file containing the new objects would depend on. We can then hold on to these objects while the pack file is in the log by creating internal references prior to committing the log entry with the pack. This approach would scale by the number of new objects and new references in the write, not by the size of the repository, and is thus a lot more efficient.
To summarize:
- We'd need a way to rev walk the new reference tips in the transaction, and print the objects in the repository's object database where the new objects connect. This means objects in
GIT_ALTERNATE_OBJECT_DIRECTORIES
where objects fromGIT_OBJECT_DIRECTORY
connect to. These are the dependencies of the logged pack file. - That walk should also print out the new objects that were actually reachable from the new tips, so we only include the objects needed by the new tips in the logged pack file.
- We'd ideally have a way to disable the connectivity checks in
git-receive-pack
, and instead rely on this approach for pushes as well for efficiency, and for centralizing the checks for all writes. This is covered in Option to disable connectivity checks in receiv... (#163).
git-receive-pack
performs a full walk when doing connectivity checks. This is needlessly heavy as the above would also suffice for checking whether the pushes are valid as well. This relaxes the connectivity checks performed by git by assuming that the objects in the repository valid. In our case, all writes will go through the log and all writes are checked to be valid, so this transitively gives the property that all objects in the repository are valid. This may not hold repositories if invalid objects have passed through prior to doing these checks, so maybe we'll need some migration to remove invalid objects. This may also be relatively rare enough that we don't have to care, and the repository was corrupted in any case.