Missing fsyncs may lead to data loss and repository corruption
Gitaly is missing fsyncs in multiple operations. This may lead to data loss or repository corruption if a host crashes.
-
Object quarantine migration has multiple issues:
- Objects are moved from the quarantine directory in to the repository's object database. Their directory entries are not fsynced so they may disappear after a crash. If Git updated references to point to these objects, the repository may be corrupted post crash.
- Unrelated to fsyncs, but the objects are not migrated in a dependency order. A commit may be migrated into the repository prior to its tree for example. This commit could later be referenced as Gitaly doesn't verify full connectivity prior to making references, leading to a corrupted repository.
- Repository creations are not fsynced
- Repository's directory entry is not fsynced post move: https://gitlab.com/gitlab-org/gitaly/-/blob/master/internal/gitaly/repoutil/create.go#L222-226
- Repository removals are not fsynced: https://gitlab.com/gitlab-org/gitaly/-/blob/816ee7cc1b07ee36aa9c3f5ed93c2715db5c4673/internal/gitaly/service/repository/remove.go#L80-89
- Repository moves are not fsynced: https://gitlab.com/gitlab-org/gitaly/-/blob/816ee7cc1b07ee36aa9c3f5ed93c2715db5c4673/internal/gitaly/service/repository/rename.go#L93-95
- Pools
- Object pool removal is not synced: https://gitlab.com/gitlab-org/gitaly/-/blob/cfb79565c3c4099551583fd656b3f676f89299c0/internal/git/objectpool/pool.go#L134
There are likely more, and we should vet the code base to find them. These were the writes that came quickly to my mind.
Git should fsync its writes, so these are mostly cases where Gitaly is performing the writes. If it doesn't do so in some cases, those would have to be fixed there.
The writes in the WAL (&8911) are fsynced, so this would be resolved once that's in place. We may want to address these before the WAL given the data loss potential.
Edited by Sami Hiltunen