WIP: Prune unreferenced git LFS objects
Compare changes
Files
5- James Edwards-Jones authored
app/workers/lfs_cleanup_worker.rb
0 → 100644
+ 14
− 0
Tracks LFS pointer blobs and uses these to remove LFS objects which are no longer referenced
When references to large files have been removed they shouldn't be kept around
On push:
newrev
of the updated ref (000->abc master
)Update LfsPointer worker:
rev-list
to identify blobs in the branch which are not included in already processed refs.
On Gc:
The main trade off is memory+database space vs extra blob lookups on the NFS disk.
Storing the list of processed refs allows us to eliminate blobs in commits which have already been checked for LFS pointers. This also works for objects introduced by similar commits as any objects introduced by both C
and and C'
can be eliminated by rev-list C' --not C --objects
. When a new branch is pushed only new objects are checked for the same reason.
This approach guarantees both that blob lookups are kept to a minimum, and that all pointers have been found by default. It holds that if all pushes / RefrenceChange
s have been processed that all Lfs pointers have been found, making it safe to delete those which are in the database but no longer on disk. Without this there could be LfsObjects in a project for which we have found one pointer but not another, and end up deleting a LfsObjects which are still referenced by unfound pointers.
RecentLfsPush
entries past the 100 most recent to avoid the table becoming too largeUpdateLfsPointersWorker
per project?Gitlab::Git::Blob.batch_lfs_metadata
should bypass gitalyTODO
Ideas:
TODO
For added migrations:
db/schema.rb
down
method so the migration can be revertedspec/migrations
if necessary (e.g. when
migrating data)For added tables:
For potentially slow queries:
EXPLAIN ANALYZE
and execution timings of the
relevant queries