Users or admins run up against storage limits, and realize that they are using a lot of storage on LFS objects they no longer want or need. Then they find there is no way to remove them from the GitLab-managed LFS storage without deleting the project.
The Git repo is the SSOT, but the data is elsewhere, only tracked by project. So in order to know if you can delete the data, you have to scan the repo or have been tracking the pointers the whole time.
Ideal solution
Track pointers the whole time and automatically delete objects when unreferenced. It is technically possible, and it is the most user-friendly solution. We started down this road in gitlab-foss!14479 (closed). But this is a large, complex MR, and from my limited understanding, more than weight 5. Especially including performance validation etc. There are apparently a lot of performance risks and pitfalls. Also it looks like existing repos weren't handled yet.
Proposal
Rather than focusing on the ideal solution first, we can iterate using a boring solution:
Create a rake task to clean a single project in a non-performant way. Note that this would also allow us to validate our performance concerns. Maybe from there we could implement a way for project Maintainers to queue a clean, and run those single file. It's not ideal, but users would at least have some recourse. It is similar to how we handle other cleanup tasks.
Needs to be well tested because it otherwise may remove still needed data
What does success look like, and how can we measure that?
A systems administrator can schedule to sudo gitlab-rake gitlab:cleanup:lfs_files on a per project basis e.g. over the weekend to remove LFS files that are no longer required.
@fzimmer This needs 4 MRs, so unfortunately a decent amount of overhead, but the two "install git-lfs" MRs should be small, and the overall concept seems straightforward. So the time from start to completion may be long, but I estimate the total time spent around weight 4 or 5.
Thanks @mkozono! Are you happy to still leave it as one issue? I think having four MRs against it and utilising a checklist is a bit more straightforward than creating an epic. Less overhead.
Got one more suggestion for this task... I've found that Geo sometimes thinks that a file is replicated but it doesn't exist on disk. I think this is a hangover from the cutover from the old Geo replication methodology to the current but I'm finding a few LFS files that Geo::FileRegistry thinks are replicated but which are not on disk.. eg
I've found that Geo sometimes thinks that a file is replicated but it doesn't exist on disk.
@pherlihy This sounds like a distinctly different problem. This issue's proposal makes sense to run only on a primary. I think your comment here is worthy of a separate issue and investigation by groupgeo since it is a bug for DR.
For transparency, while this issue is marked as groupgeo (as the Geo Team will perform the work), this work is being done in the domain of groupsource code. (/cc @m_gill)
Since this rake task will be deleting files, we need to ensure that we have a robust set of test cases in place that need to surround this work. @jennielouie Making sure you are aware of this issue as it would be good to have your opinion on the tests here too.
Not a silly question, but @fzimmer is correct, we can't rely on git lfs prune because we also need to handle non-local files. Also I think it would orphan lfs_objects records.
Sorry @rnienaber, I remembered wrong. The main reason I think we can't use git lfs prune is because it's designed to be run by an individual user, with the assumption that all the LFS files are available in the remote.
Example:
I'm using a repo which has lots of LFS files in its history. I ran out of local disk space, so I run git lfs prune. This deletes all LFS files that are not referenced in the current state of the codebase.
But if I checkout an old commit that references an old LFS file, I can pull all those old LFS files down from the remote.
So, if I am the remote, then I must keep all LFS files that are referenced at any point in the repo's history, otherwise, we have dead/invalid references.
To add more context so you don't have to go back too far: I ask because my understanding is that Gitaly is supposed to protect itself from harm, and I assume this command may be slower on some projects than some Gitaly timeout.
However, we accept in cases like this (see other cleanup rake tasks that find all files of a kind) that an administrator may need to do something slow to correct a situation. This rake task would initially be written to operate on a single project at a time. If a sysadmin chooses to write a script to run it on all projects, that is up to them, and whether they think their infrastructure can take it. This rake task could also be used as a way to iterate on improving performance or reducing load, while providing immediate value for many sysadmins.
We currently don't install git-lfs on Gitaly servers at all. So if you want to use git lfs ls-files that is one thing you need to deal with (make sure it's installed in omnibus etc.).
@mkozono There is no reason we can't have slow RPC calls. The crucial question is, when do we make those calls. And how do we avoid running the same slow call 1000 times (in a row or in parallel) if somebody e.g. pushes 1000 branches. That is: do it in Sidekiq, and manage how often you do it per repo.
Some of the existing LFS tracking stuff runs during Git push hooks, which is just a plain bad idea: Git hook validations are implemented as HTTP API calls, which have stricter timeouts than Sidekiq jobs, and these calls run synchronously during the push, so the user has to sit and wait for these calls to finish before their local git push command is done.
We currently don't install git-lfs on Gitaly servers at all. So if you want to use git lfs ls-files that is one thing you need to deal with (make sure it's installed in omnibus etc.).
Regarding git lfs, I would avoid install the binary but nothing more. The recommended way to install integrates it into your local git installation and we don't want that on the gitaly server. E.g. when we do a git rebase with a worktree, we don't want LFS to start downloading LFS blobs to the gitaly server.
Edit: fixed typo. I meant: install the binary so that it's in PATH but nothing more.
Update. I was playing with GetLFSPointers endpoint but it does not work for us because it only works for specific blob ids, I also took a look at GetAllLFSPointers endpoint but it does not work for us because it needs a specific revision, while we need all the refs at once.
Also according to what I see, we don't necessarily need git lfs to be installed. As @jacobvosmaer-gitlab suggested in some other issue we can use rev-list and cat-file to do this job. Also, this is what we do in GetAllLFSPointers. We may want to make revision not-required parameter for GetAllLFSPointers and then it should work for us. WDYT @jacobvosmaer-gitlab?
Thanks for noticing that @vsizov ; if all we need to do is to tweak the git rev-list arguments in GetAllLFSPointers then it makes more sense to augment that RPC than to install a new dependency (git lfs) on the Gitaly servers.
We may want to make revision not-required parameter for GetAllLFSPointers
I think it would be better to use a more explicit mechanism. For example we could add a field all_refs bool to GetAllLFSPointersRequest.
@jacobvosmaer-gitlab This one gitaly!1414 (comment 258929179) is slightly related. I think I found the flaw in that new Go implementation under the feature flag. But I think I need to go with a classic implementation anyway and extend it first.
Update
This issue is full of surprises. Today I found that Gitaly's GetAllLFSPointers endpoint has a bug, it ignores revision parameter. And this is exactly what we need. All the code that uses this method expects the behavior like this so we can use it with placeholder HEAD. I created an issue to fix that gitaly#2247 (closed). That means that we can implement this feature pretty easy. I should be able to send it to review tomorrow.
When this task is run, we may reveal another problem with the fork networks. We need to make this task opt-in and not run automatically, and we need to make the user aware of the potential problem that they may be exposing themselves to. Where should this warning be added?
Another option is to wait until we handle #20042 (closed) properly. I don't think we even need to ask for confirmation from the admin because making data inconsistent is something we should never do.
Same here, I don't think we should alter the LFS removal code based on an unfixed bug that is being worked on. #20042 (closed) is also scheduled for %12.7 so in an ideal world, this should just be released at the same time.
We should maybe link those two explicitly though via MR dependencies?