Prune unreferenced Git LFS objects

added ~480950 ~14213 ~129601 ~155949 labels

mentioned in merge request !5901 (merged)

I think this is a critical feature. It would also be useful to be able to remove (from lfs storage) a LFS file from the GUI or from command-line.

This is also related to #30594 (closed)

changed title from Remove unreferenced Git LFS objects to Prune unreferenced Git LFS objects

Closing as the merge request adds a cron job that removes unreferenced LFS objects once a day https://gitlab.com/gitlab-org/gitlab-ce/blob/master/config/initializers/1_settings.rb#L363

closed

reopened

That was added 6 months ago and only removes objects unreferenced by a project, not by commit. It is unrelated to the requested feature. I specifically referenced that merge request and described how it is insufficient in the description.

@mydigitalself This is a support scheduling request. We'd like to get this issue prioritized.

changed milestone to %9.5

changed milestone to %10.0

@cpallares i've had to push this out a release, we're very short of available developers this month owing to other initiatives.

added Deliverable label

assigned to @jamedjo

@DouweM and I spoke and came up with an outline to solve this:

When uploading an LFS pointer store it's blob OID in the database along with with the project_id and existing lfs_object_id, perhaps naming this LfsBlob

During GC check for blob OIDs which are no longer referenced and when when removing them look up the related lfs_object_id

Remove those LfsObjects from disk/storage followed by removing the LfsObject and LfsBlob from the database

We'd might have to process existing LFS pointers when adding this feature for the database to be accurate. An alternate first iteration might instead use LfsBlob.where(project: project, oid: oids_to_remove_in_gc) to delete only files which use the new system.

It is also possible that a git lfs push --all results in LFS objects on the server which are not referenced by any LFS pointers on the remote. In this case we could do the reverse and lookup all LfsObject rows which do not have matching LfsBlob records for that project.

mentioned in issue #24564 (closed)

When uploading an LFS pointer store it's blob OID in the database along with with the project_id and existing lfs_object_id, perhaps naming this LfsBlob

This would look roughly like iterating over all incoming refs, their commits, their diffs, and collecting all new blobs. This is similar to https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/checks/change_access.rb#L221 and http://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/services/git_push_service.rb#L78. Commit#raw_deltas returns an array of Gitlab::Git::Diff objects, that wrap a Rugged::Diff::Delta. You can get the old and new blob OIDs of this diff delta from delta.old_file[:oid] and delta.new_file[:oid]. If this value changed, you know that a new blob OID is coming in. Once we know what blobs are coming in, we have to check each of them to see if it is an LFS pointer. For this, we have a convenient Gitlab::Git::Blob#lfs_pointer? method, which is also available on our Blob model which wraps this Gitlab::Git::Blob. If it looks like an LFS pointer, we can look up the LfsObject in the DB by Blob#lfs_oid, and create a new LfsObjectBlob record linking blob OID to lfs OID, also saving the project ID if appropriate.

During GC check for blob OIDs which are no longer referenced and when when removing them look up the related lfs_object_id

Remove those LfsObjects from disk/storage followed by removing the LfsObject and LfsBlob from the database

I don't know if we can hook into GC to find the blob OIDs it's about to clean up. What we can do, is extend the periodically run RemoveUnreferencedLfsObjectsWorker to iterate over all LfsObjectBlobs and seeing if that blob OID still exists inside that project. If not, we can delete the LfsObjectBlob. If an LfsObjectProject has 0 LfsObjectBlobs left for that project_id, we can remove the LfsObjectProject. If an LfsObject has 0 LfsObjectProjects left, we can remove the LfsObject record and file (this is already what LfsObject.destroy_unreferenced does).

We'd might have to process existing LFS pointers when adding this feature for the database to be accurate. An alternate first iteration might instead use LfsBlob.where(project: project, oid: oids_to_remove_in_gc) to delete only files which use the new system.

We need to somehow know that an old LfsObjectProject that doesn't have LfsObjectBlobs, shouldn't be treated as unreferenced and automatically be cleaned up. We could add a boolean flag to these old LfsObjectProjects that don't track_blobs yet. The LfsObject attached to this LfsProject would only be cleaned up once its loses all LfsObjectProjects, when all connected projects are actually deleted.

It is also possible that a git lfs push --all results in LFS objects on the server which are not referenced by any LFS pointers on the remote. In this case we could do the reverse and lookup all LfsObject rows which do not have matching LfsBlob records for that project.

Is that something that could happen? I'd imagine git-lfs does some local tracking of files it needs or doesn't, and wouldn't push a file that's not referenced by ant LFS pointer on the remote, because it would already be deleted. If not, that's no big deal either, since those LfsObjects could be cleaned up in that regular RemoveUnreferencedLfsObjectsWorker.

@jamedjo What do you think?

changed milestone to %10.1

This is taking longer than anticipated to address, so need to move the milestone out to the next release.

added workflowin dev label

mentioned in merge request !14479 (closed)

Is that something that could happen? I'd imagine git-lfs does some local tracking of files it needs or doesn't, and wouldn't push a file that's not referenced by ant LFS pointer on the remote, because it would already be deleted.

It usually does, but can be forced to push all locally referenced objects with --all

This would look roughly like iterating over all incoming refs, their commits, their diffs, and collecting all new blobs.

We'll have to do this in a background worker as this could be quite slow. We might also be duplicating work, e.g. a new branch would contain lots of new commits we might already have checked for LFS pointers. Using rev-list --not PROCESSED_REFS --objects might be a more efficient way to search for new LFS pointers.

mentioned in issue #38689 (moved)

I've updated the merge request (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14479) with an outline of the approach

We need to somehow know that an old LfsObjectProject that doesn't have LfsObjectBlobs, shouldn't be treated as unreferenced and automatically be cleaned up. We could add a boolean flag to these old LfsObjectProjects that don't track_blobs yet.

If we find all and store all LfsPointer (aka LfsObjectBlob) objects when processing the first push we can use this to guarantee a project can have unconnected LfsObjectProjects removed. At the moment I'm using the presence of a ReferenceChange which has been processed for a given project to determine this.

What could be enough to begin with, and at least useful for debugging is a view of a table for a given project with:

LFS object ID
LFS object size
nb of references

and a button to delete unreferenced LFS objects.

What could be enough to begin with, and at least useful for debugging is a view of a table for a given project with:

LFS object ID

LFS object size

nb of references

and a button to delete unreferenced LFS objects.

@kforner Once we're able to detect unreferenced LFS objects I think it makes sense to have this run automatically. Would you still be looking for a dashboard and ability to delete individual objects?

@jamedjo sure. but it would be useful to check, and have an overview of the size taken by the lfs objects. Moreover it is not always trivial in gitlab to know when background jobs have run. I don't think deleting individual objects would be useful/

added workflowin review and removed workflowin dev labels

@kforner Would it help if there was a section listing the total LFS size for a project along with when LFS GC last ran / or is next scheduled to run?

@jamedjo why not. But I still think that the table I proposed, even in a temp/hidden page or a command line/rake stuff would be extremely useful.

mentioned in issue #38815 (moved)

@kforner Good idea! I've created an issue, Increase visibility into LFS usage and GC (#38815), mentioning both the rake task and status page solutions

changed milestone to %10.2

mentioned in issue #39060 (moved)

@mydigitalself I've moved this to %10.3 as Stretch, because https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14479 hasn't made it in in time for 10.2, but is close to done and can likely be finished after @jamedjo's other %10.3 issues.

added Stretch and removed Deliverable labels

changed milestone to %10.3

mentioned in issue #30594 (closed)

I removed some folders with BFG repo cleaner. Will gitlab server clean these files as well?

Waiting for my repo size to be decreased

added priority3 severity3 and removed ~1672339 labels

mentioned in issue #47439 (closed)

I would love to see thing functionality being implemented as well. When dealing with media content, people regularly modifying many-gigabyte-sized binary files. This fills-up disk space pretty quickly, so it would be nice to have an ability to physically remove files from the disk. Combination of BFG repo cleaner and 'git lfs prune' commands would be a good start for solving this issue. Later it may be automated even further if GitLab provides a simple interface for 'BFG repo cleaner' part of the operation as well. Thank you for working on this task!

@DouweM @jamedjo looks like this fell off our radar. Is https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14479 still in a good shape? How much weight is remaining?

added devopscreate and removed Stretch labels

changed milestone to %Backlog

Yes I'd like to see this as well, our project began with a lot of placeholder 3d models that are taking up quite some space now and are no longer used. There is no reason for us to keep the history that old so it would be very very nice to just delete all of that old history and the associated LFS files.

Not sure if this is the right place to mention this - what would be SUPER awesome is to have option in the GUI "delete all history before " that just takes a snapshot of it at that date, makes that the initial commit and removes everying including lfs files before then.

I've searched around for good ways to do this, gitHUB says "the only way is to delete your repository and reinitialize it and this will also delete your wiki and bugs etc" surely gitlab.com has a more elegant solution?

@youreperfect — If the pruning is implemented generally, I think you will be able to do what you want reasonably straightforwardly. First of all you would use git rebase to squash all the early commits into the snapshot you want. You would then do a forced push to overwrite the repository state on GitLab.

At the moment this won't clean up the LFS objects that are no longer referenced, but once this issue is sorted out, presumably it will.

Is https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14479 still in a good shape? How much weight is remaining?

@jramsay I think it was stalled on the potential nfs slowdown from scanning all objects to initially detect LFS pointers and other uncertainties around the speed of rev-list. Since then we've used the rev-list approach for the LFS integrity checks so have more confidence, but applied a 2000 object limit so not enough.

It likely would also need updating to work with Gitaly. Maybe one way to reduce scope would be to explicitly rule out GitLab.com so we don't have to think about things at that scale until we are more certain of the performance characteristics on smaller non nfs servers. We could then have it behind a feature flag to experiment and improve. I haven't looked at this in a while though so that may not be necessary.

Prune unreferenced Git LFS objects

Description

Proposal

Designs

Child items ...

Activity

Prune unreferenced Git LFS objects

Description

Proposal

Relates to

Activity