LFS objects are kept even if all of the the commits that reference them are pruned, due to branch deletion or history re-write.
Proposal
Delete LFS objects which are no longer referenced by any commits in any project.
Currently, LFS objects are deleted based on project reference counts. Project reference counts don't get updated when commits are abandoned. If branches are deleted or history re-written, any commits with LFS pointers that later get pruned won't result in any change to the project's reference count on the objects. Project can end up not referencing the objects in any commits, but still maintaining a hold on it.
Propose using some method of reference counting by commit instead of project, and handling when commits are pruned.
!5901 (merged) addressed deleting these on project deletion, but in many cases you want to keep the project, but abandon a branch or re-write history to fix mistaken addition of bad data, without having to abandon the entire project.
That was added 6 months ago and only removes objects unreferenced by a project, not by commit. It is unrelated to the requested feature. I specifically referenced that merge request and described how it is insufficient in the description.
@DouweM and I spoke and came up with an outline to solve this:
When uploading an LFS pointer store it's blob OID in the database along with with the project_id and existing lfs_object_id, perhaps naming this LfsBlob
During GC check for blob OIDs which are no longer referenced and when when removing them look up the related lfs_object_id
Remove those LfsObjects from disk/storage followed by removing the LfsObject and LfsBlob from the database
We'd might have to process existing LFS pointers when adding this feature for the database to be accurate. An alternate first iteration might instead use LfsBlob.where(project: project, oid: oids_to_remove_in_gc) to delete only files which use the new system.
It is also possible that a git lfs push --all results in LFS objects on the server which are not referenced by any LFS pointers on the remote. In this case we could do the reverse and lookup all LfsObject rows which do not have matching LfsBlob records for that project.
When uploading an LFS pointer store it's blob OID in the database along with with the project_id and existing lfs_object_id, perhaps naming this LfsBlob
This would look roughly like iterating over all incoming refs, their commits, their diffs, and collecting all new blobs. This is similar to https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/checks/change_access.rb#L221 and http://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/services/git_push_service.rb#L78. Commit#raw_deltas returns an array of Gitlab::Git::Diff objects, that wrap a Rugged::Diff::Delta. You can get the old and new blob OIDs of this diff delta from delta.old_file[:oid] and delta.new_file[:oid]. If this value changed, you know that a new blob OID is coming in. Once we know what blobs are coming in, we have to check each of them to see if it is an LFS pointer. For this, we have a convenient Gitlab::Git::Blob#lfs_pointer? method, which is also available on our Blob model which wraps this Gitlab::Git::Blob. If it looks like an LFS pointer, we can look up the LfsObject in the DB by Blob#lfs_oid, and create a new LfsObjectBlob record linking blob OID to lfs OID, also saving the project ID if appropriate.
During GC check for blob OIDs which are no longer referenced and when when removing them look up the related lfs_object_id
Remove those LfsObjects from disk/storage followed by removing the LfsObject and LfsBlob from the database
I don't know if we can hook into GC to find the blob OIDs it's about to clean up. What we can do, is extend the periodically run RemoveUnreferencedLfsObjectsWorker to iterate over all LfsObjectBlobs and seeing if that blob OID still exists inside that project. If not, we can delete the LfsObjectBlob. If an LfsObjectProject has 0 LfsObjectBlobs left for that project_id, we can remove the LfsObjectProject. If an LfsObject has 0 LfsObjectProjects left, we can remove the LfsObject record and file (this is already what LfsObject.destroy_unreferenced does).
We'd might have to process existing LFS pointers when adding this feature for the database to be accurate. An alternate first iteration might instead use LfsBlob.where(project: project, oid: oids_to_remove_in_gc) to delete only files which use the new system.
We need to somehow know that an old LfsObjectProject that doesn't have LfsObjectBlobs, shouldn't be treated as unreferenced and automatically be cleaned up. We could add a boolean flag to these old LfsObjectProjects that don't track_blobs yet. The LfsObject attached to this LfsProject would only be cleaned up once its loses all LfsObjectProjects, when all connected projects are actually deleted.
It is also possible that a git lfs push --all results in LFS objects on the server which are not referenced by any LFS pointers on the remote. In this case we could do the reverse and lookup all LfsObject rows which do not have matching LfsBlob records for that project.
Is that something that could happen? I'd imagine git-lfs does some local tracking of files it needs or doesn't, and wouldn't push a file that's not referenced by ant LFS pointer on the remote, because it would already be deleted. If not, that's no big deal either, since those LfsObjects could be cleaned up in that regular RemoveUnreferencedLfsObjectsWorker.
Is that something that could happen? I'd imagine git-lfs does some local tracking of files it needs or doesn't, and wouldn't push a file that's not referenced by ant LFS pointer on the remote, because it would already be deleted.
It usually does, but can be forced to push all locally referenced objects with --all
This would look roughly like iterating over all incoming refs, their commits, their diffs, and collecting all new blobs.
We'll have to do this in a background worker as this could be quite slow. We might also be duplicating work, e.g. a new branch would contain lots of new commits we might already have checked for LFS pointers. Using rev-list --not PROCESSED_REFS --objects might be a more efficient way to search for new LFS pointers.
We need to somehow know that an old LfsObjectProject that doesn't have LfsObjectBlobs, shouldn't be treated as unreferenced and automatically be cleaned up. We could add a boolean flag to these old LfsObjectProjects that don't track_blobs yet.
If we find all and store all LfsPointer (aka LfsObjectBlob) objects when processing the first push we can use this to guarantee a project can have unconnected LfsObjectProjects removed. At the moment I'm using the presence of a ReferenceChange which has been processed for a given project to determine this.
What could be enough to begin with, and at least useful for debugging is a view of a table for a given project with:
LFS object ID
LFS object size
nb of references
and a button to delete unreferenced LFS objects.
@kforner Once we're able to detect unreferenced LFS objects I think it makes sense to have this run automatically. Would you still be looking for a dashboard and ability to delete individual objects?
@jamedjo sure. but it would be useful to check, and have an overview of the size taken by the lfs objects. Moreover it is not always trivial in gitlab to know when background jobs have run.
I don't think deleting individual objects would be useful/
I would love to see thing functionality being implemented as well. When dealing with media content, people regularly modifying many-gigabyte-sized binary files. This fills-up disk space pretty quickly, so it would be nice to have an ability to physically remove files from the disk. Combination of BFG repo cleaner and 'git lfs prune' commands would be a good start for solving this issue.
Later it may be automated even further if GitLab provides a simple interface for 'BFG repo cleaner' part of the operation as well.
Thank you for working on this task!
Yes I'd like to see this as well, our project began with a lot of placeholder 3d models that are taking up quite some space now and are no longer used. There is no reason for us to keep the history that old so it would be very very nice to just delete all of that old history and the associated LFS files.
Not sure if this is the right place to mention this - what would be SUPER awesome is to have option in the GUI "delete all history before " that just takes a snapshot of it at that date, makes that the initial commit and removes everying including lfs files before then.
I've searched around for good ways to do this, gitHUB says "the only way is to delete your repository and reinitialize it and this will also delete your wiki and bugs etc" surely gitlab.com has a more elegant solution?
@youreperfect — If the pruning is implemented generally, I think you will be able to do what you want reasonably straightforwardly. First of all you would use git rebase to squash all the early commits into the snapshot you want. You would then do a forced push to overwrite the repository state on GitLab.
At the moment this won't clean up the LFS objects that are no longer referenced, but once this issue is sorted out, presumably it will.
@jramsay I think it was stalled on the potential nfs slowdown from scanning all objects to initially detect LFS pointers and other uncertainties around the speed of rev-list. Since then we've used the rev-list approach for the LFS integrity checks so have more confidence, but applied a 2000 object limit so not enough.
It likely would also need updating to work with Gitaly. Maybe one way to reduce scope would be to explicitly rule out GitLab.com so we don't have to think about things at that scale until we are more certain of the performance characteristics on smaller non nfs servers. We could then have it behind a feature flag to experiment and improve. I haven't looked at this in a while though so that may not be necessary.