WIP: Prune unreferenced git LFS objects
What
Tracks LFS pointer blobs and uses these to remove LFS objects which are no longer referenced
Why
When references to large files have been removed they shouldn't be kept around
ReferenceChange approach
On push:
- Store the name and
newrev
of the updated ref (000->abc master
) - Schedule worker to find new LFS pointers from that change
Update LfsPointer worker:
- Find new blobs by using
rev-list
to identify blobs in the branch which are not included in already processed refs.- Limit list of processed refs to the latest for each ref name and to N latest overall to avoid handling thousands of refs.
- Clean up entries past the 100 most recent to avoid the table becoming too large
- Find new LFS pointers from those blobs and store them in the database
On Gc:
- Ignore projects with reference changes to process
- Ignore projects which havent't had existing pointers processed
- Remove LfsPointers which no longer exist in the project
- Remove LfsObjectProjects/LfsObjects which are no longer referenced by pointers
Rational
The main trade off is memory+database space vs extra blob lookups on the NFS disk.
Storing the list of processed refs allows us to eliminate blobs in commits which have already been checked for LFS pointers. This also works for objects introduced by similar commits as any objects introduced by both C
and and C'
can be eliminated by rev-list C' --not C --objects
. When a new branch is pushed only new objects are checked for the same reason.
This approach guarantees both that blob lookups are kept to a minimum, and that all pointers have been found by default. It holds that if all pushes / RefrenceChange
s have been processed that all Lfs pointers have been found, making it safe to delete those which are in the database but no longer on disk. Without this there could be LfsObjects in a project for which we have found one pointer but not another, and end up deleting a LfsObjects which are still referenced by unfound pointers.
OldRev NewRev approach
Are there points in the code the reviewer needs to double check?
Todo
-
Find all pointers of first push -
Clean up RecentLfsPush
entries past the 100 most recent to avoid the table becoming too large -
Mysql for finding 100 most recent refs -
Add indices for columns used in queries -
Add database checklist -
Add performance testing plan to description (towards making a case why this won’t perform badly on production) -
Ping someone for Gitaly review -
Ping someone for database review -
Performance review
Things I'll MR open discussions
-
What happens if multiple pushes occur before/during first run? Would we have multiple workers scanning the whole project? Could we benefit from a lock around UpdateLfsPointersWorker
per project? -
Gitlab::Git::Blob.batch_lfs_metadata
should bypass gitaly -
Better way to get all blobs? Possible to get all blobs within size range?
Performance
TODO
Ideas:
- Generate 10,000s of LfsObjectProject, etc and test
- Set up test instance and find memory characteristics
- Add one LFS object to linux project and test it
- Lookup current count of LfsObjectProject items to find current scale
Benchmarks
TODO
Database checklist
For added migrations:
-
Updated db/schema.rb
-
Added a down
method so the migration can be reverted -
Added the output of the migration(s) to the MR body -
Added the execution time of the migration(s) to the MR body -
Added tests for the migration in spec/migrations
if necessary (e.g. when migrating data) -
Made sure the migration won't interfere with a running GitLab cluster, for example by disabling transactions for long running migrations
For added tables:
-
Ordered columns based on their type sizes in descending order -
Added foreign keys if necessary -
Added indexes if necessary -
Described the need for these indexes in the MR body -
Made sure existing indexes can not be reused instead
-
For potentially slow queries:
-
Included the raw SQL queries of the relevant queries -
Included the output of EXPLAIN ANALYZE
and execution timings of the relevant queries
Acceptance criteria
-
Changelog entry added, if necessary -
Documentation created/updated -
Tests added for this feature/bug - Review
-
Has been reviewed by UX -
Has been reviewed by Frontend -
Has been reviewed by Backend -
Has been reviewed by Database
-
-
Conform by the merge request performance guides