LFS objects are kept even if all of the the commits that reference them are pruned, due to branch deletion or history re-write.
Proposal
Delete LFS objects which are no longer referenced by any commits in any project.
Currently, LFS objects are deleted based on project reference counts. Project reference counts don't get updated when commits are abandoned. If branches are deleted or history re-written, any commits with LFS pointers that later get pruned won't result in any change to the project's reference count on the objects. Project can end up not referencing the objects in any commits, but still maintaining a hold on it.
Propose using some method of reference counting by commit instead of project, and handling when commits are pruned.
!5901 (merged) addressed deleting these on project deletion, but in many cases you want to keep the project, but abandon a branch or re-write history to fix mistaken addition of bad data, without having to abandon the entire project.
That was added 6 months ago and only removes objects unreferenced by a project, not by commit. It is unrelated to the requested feature. I specifically referenced that merge request and described how it is insufficient in the description.
@DouweM and I spoke and came up with an outline to solve this:
When uploading an LFS pointer store it's blob OID in the database along with with the project_id and existing lfs_object_id, perhaps naming this LfsBlob
During GC check for blob OIDs which are no longer referenced and when when removing them look up the related lfs_object_id
Remove those LfsObjects from disk/storage followed by removing the LfsObject and LfsBlob from the database
We'd might have to process existing LFS pointers when adding this feature for the database to be accurate. An alternate first iteration might instead use LfsBlob.where(project: project, oid: oids_to_remove_in_gc) to delete only files which use the new system.
It is also possible that a git lfs push --all results in LFS objects on the server which are not referenced by any LFS pointers on the remote. In this case we could do the reverse and lookup all LfsObject rows which do not have matching LfsBlob records for that project.
When uploading an LFS pointer store it's blob OID in the database along with with the project_id and existing lfs_object_id, perhaps naming this LfsBlob
This would look roughly like iterating over all incoming refs, their commits, their diffs, and collecting all new blobs. This is similar to https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/checks/change_access.rb#L221 and http://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/services/git_push_service.rb#L78. Commit#raw_deltas returns an array of Gitlab::Git::Diff objects, that wrap a Rugged::Diff::Delta. You can get the old and new blob OIDs of this diff delta from delta.old_file[:oid] and delta.new_file[:oid]. If this value changed, you know that a new blob OID is coming in. Once we know what blobs are coming in, we have to check each of them to see if it is an LFS pointer. For this, we have a convenient Gitlab::Git::Blob#lfs_pointer? method, which is also available on our Blob model which wraps this Gitlab::Git::Blob. If it looks like an LFS pointer, we can look up the LfsObject in the DB by Blob#lfs_oid, and create a new LfsObjectBlob record linking blob OID to lfs OID, also saving the project ID if appropriate.
During GC check for blob OIDs which are no longer referenced and when when removing them look up the related lfs_object_id
Remove those LfsObjects from disk/storage followed by removing the LfsObject and LfsBlob from the database
I don't know if we can hook into GC to find the blob OIDs it's about to clean up. What we can do, is extend the periodically run RemoveUnreferencedLfsObjectsWorker to iterate over all LfsObjectBlobs and seeing if that blob OID still exists inside that project. If not, we can delete the LfsObjectBlob. If an LfsObjectProject has 0 LfsObjectBlobs left for that project_id, we can remove the LfsObjectProject. If an LfsObject has 0 LfsObjectProjects left, we can remove the LfsObject record and file (this is already what LfsObject.destroy_unreferenced does).
We'd might have to process existing LFS pointers when adding this feature for the database to be accurate. An alternate first iteration might instead use LfsBlob.where(project: project, oid: oids_to_remove_in_gc) to delete only files which use the new system.
We need to somehow know that an old LfsObjectProject that doesn't have LfsObjectBlobs, shouldn't be treated as unreferenced and automatically be cleaned up. We could add a boolean flag to these old LfsObjectProjects that don't track_blobs yet. The LfsObject attached to this LfsProject would only be cleaned up once its loses all LfsObjectProjects, when all connected projects are actually deleted.
It is also possible that a git lfs push --all results in LFS objects on the server which are not referenced by any LFS pointers on the remote. In this case we could do the reverse and lookup all LfsObject rows which do not have matching LfsBlob records for that project.
Is that something that could happen? I'd imagine git-lfs does some local tracking of files it needs or doesn't, and wouldn't push a file that's not referenced by ant LFS pointer on the remote, because it would already be deleted. If not, that's no big deal either, since those LfsObjects could be cleaned up in that regular RemoveUnreferencedLfsObjectsWorker.
Is that something that could happen? I'd imagine git-lfs does some local tracking of files it needs or doesn't, and wouldn't push a file that's not referenced by ant LFS pointer on the remote, because it would already be deleted.
It usually does, but can be forced to push all locally referenced objects with --all
This would look roughly like iterating over all incoming refs, their commits, their diffs, and collecting all new blobs.
We'll have to do this in a background worker as this could be quite slow. We might also be duplicating work, e.g. a new branch would contain lots of new commits we might already have checked for LFS pointers. Using rev-list --not PROCESSED_REFS --objects might be a more efficient way to search for new LFS pointers.
We need to somehow know that an old LfsObjectProject that doesn't have LfsObjectBlobs, shouldn't be treated as unreferenced and automatically be cleaned up. We could add a boolean flag to these old LfsObjectProjects that don't track_blobs yet.
If we find all and store all LfsPointer (aka LfsObjectBlob) objects when processing the first push we can use this to guarantee a project can have unconnected LfsObjectProjects removed. At the moment I'm using the presence of a ReferenceChange which has been processed for a given project to determine this.
What could be enough to begin with, and at least useful for debugging is a view of a table for a given project with:
LFS object ID
LFS object size
nb of references
and a button to delete unreferenced LFS objects.
@kforner Once we're able to detect unreferenced LFS objects I think it makes sense to have this run automatically. Would you still be looking for a dashboard and ability to delete individual objects?
@jamedjo sure. but it would be useful to check, and have an overview of the size taken by the lfs objects. Moreover it is not always trivial in gitlab to know when background jobs have run.
I don't think deleting individual objects would be useful/
I would love to see thing functionality being implemented as well. When dealing with media content, people regularly modifying many-gigabyte-sized binary files. This fills-up disk space pretty quickly, so it would be nice to have an ability to physically remove files from the disk. Combination of BFG repo cleaner and 'git lfs prune' commands would be a good start for solving this issue.
Later it may be automated even further if GitLab provides a simple interface for 'BFG repo cleaner' part of the operation as well.
Thank you for working on this task!
Yes I'd like to see this as well, our project began with a lot of placeholder 3d models that are taking up quite some space now and are no longer used. There is no reason for us to keep the history that old so it would be very very nice to just delete all of that old history and the associated LFS files.
Not sure if this is the right place to mention this - what would be SUPER awesome is to have option in the GUI "delete all history before " that just takes a snapshot of it at that date, makes that the initial commit and removes everying including lfs files before then.
I've searched around for good ways to do this, gitHUB says "the only way is to delete your repository and reinitialize it and this will also delete your wiki and bugs etc" surely gitlab.com has a more elegant solution?
@youreperfect — If the pruning is implemented generally, I think you will be able to do what you want reasonably straightforwardly. First of all you would use git rebase to squash all the early commits into the snapshot you want. You would then do a forced push to overwrite the repository state on GitLab.
At the moment this won't clean up the LFS objects that are no longer referenced, but once this issue is sorted out, presumably it will.
@jramsay I think it was stalled on the potential nfs slowdown from scanning all objects to initially detect LFS pointers and other uncertainties around the speed of rev-list. Since then we've used the rev-list approach for the LFS integrity checks so have more confidence, but applied a 2000 object limit so not enough.
It likely would also need updating to work with Gitaly. Maybe one way to reduce scope would be to explicitly rule out GitLab.com so we don't have to think about things at that scale until we are more certain of the performance characteristics on smaller non nfs servers. We could then have it behind a feature flag to experiment and improve. I haven't looked at this in a while though so that may not be necessary.
@jamedjo rather than ship a feature that doesn't work at all on GitLab.com scale, can we implement an iteration with a feature flag to control the object limit and only trigger it manually with the house keeping task? This should work everywhere. /cc @nick.thomas
I'm not convinced we even need to do reference counting at the commit level. We can just remove the project link for an LFS object if we can prove that the project's repositories no longer use the LFS object at scan time.
I don't have a good handle on the potential data races - it's possible that a commit-reference version is more resilient, but I think either would be subject to fundamentally the same races.
can we implement an iteration with a feature flag to control the object limit
To delete an LFS file we need to be sure it isn't used anywhere in the repo. This means we need to be sure that newly pushed objects don't point to it, and that old LFS pointers referencing it no longer exist. This will mean scanning all blobs (limited partly by filesize) to initially detect existing LFS pointers the first time this is ran. When doing that we can't have an object limit because we would then miss objects that might later end up being deleted. Alternatively we can do a full scan like this every time, but also wouldn't be able to limit the object count on that scan.
If we know the hash of the LFS pointer I think we can just check if it exists, unless a new LFS pointer might point to the file but that would probably need a change in LFS version or other modification to the pointer itself. I can't remember the details of what we store during pushes, but we might still need an initial slow pass to store information for prior commits.
This should work everywhere
I can't remember for sure, but think the concern was that checking every object in a full scan could put too much strain on the shared NFS drive and impact other repositories. I'd see the feature flag more of a way to prove/disprove that without blocking the feature and allow us to implement a more carefully throttled scan if that was the case.
I'm not convinced we even need to do reference counting at the commit level.
I'm not sure I follow what is meant by "commit level" or "commit-reference version" here. The approach I attempted 9 months ago involved using rev-list to track the relationship between LFS pointer/blob shas and the LFS object/large-file oids which does sound similar. If doing a full repo scan during a housekeeping run is acceptable we might be able to skip that though.
We use the Gitlab LFS repository alone, with multiple, legacy repositories that can't be hosted on the same gitlab instance, sharing a single lfs repository. Would it be possible to make our situation work with this feature?
I think it will be also valuable to understand what competitors offering at the moment. This may motivate developers to implement this feature in the first place. As an example, my company considers this feature very critical when making decision which product to go with.
How about start asking questions here and there:
https://community.atlassian.com/t5/Bitbucket-questions/LFS-objects-pruning/qaq-p/997341
Just nuked a lot of history of files that never should have gone into LFS. Using the BFG repo cleaner. Pushed changes, no cigar. a rake gitlab:lfs:prune task would be nice. Or have the gitlab:lfs:check one also find orphaned files.
@manhnt is it though? This is not an integration with $random_cloud_product_of_the_week, just a core part of GitLab that's not working properly so obviously it gets deprioritized.
Just stumbled into this since I experienced some failed LFS pushes that nevertheless grew my repo by several gigabytes, which are now unrecoverable. This issue substantially makes Git LFS pretty messy and barely usable on GitLab. If GitHub's bottom line on this is "delete and recreate the repo" (), fixing this properly on GitLab would give it a huge plus...
GitLab is moving all development for both GitLab Community Edition
and Enterprise Edition into a single codebase. The current
gitlab-ce repository will become a read-only mirror, without any
proprietary code. All development is moved to the current
gitlab-ee repository, which we will rename to just gitlab in the
coming weeks. As part of this migration, issues will be moved to the
current gitlab-ee project.
If you have any questions about all of this, please ask them in our
dedicated FAQ issue.
Using "gitlab" and "gitlab-ce" would be confusing, so we decided to
rename gitlab-ce to gitlab-foss to make the purpose of this FOSS
repository more clear
I created a merge requests for CE, and this got closed. What do I
need to do?
Everything in the ee/ directory is proprietary. Everything else is
free and open source software. If your merge request does not change
anything in the ee/ directory, the process of contributing changes
is the same as when using the gitlab-ce repository.
Will you accept merge requests on the gitlab-ce/gitlab-foss project
after it has been renamed?
No. Merge requests submitted to this project will be closed automatically.
Will I still be able to view old issues and merge requests in
gitlab-ce/gitlab-foss?
Yes.
How will this affect users of GitLab CE using Omnibus?
No changes will be necessary, as the packages built remain the same.
How will this affect users of GitLab CE that build from source?
Once the project has been renamed, you will need to change your Git
remotes to use this new URL. GitLab will take care of redirecting Git
operations so there is no hard deadline, but we recommend doing this
as soon as the projects have been renamed.
Where can I see a timeline of the remaining steps?