Create a rake task to cleanup unused LFS files

changed milestone to %12.6

added devopssystems groupgeo typefeature labels

changed the description

@mkozono I shamelessly copied some of your excellent description here. Would you be able to weigh this one so we can move forward and schedule?

Thanks a lot!

@fzimmer This needs 4 MRs, so unfortunately a decent amount of overhead, but the two "install git-lfs" MRs should be small, and the overall concept seems straightforward. So the time from start to completion may be long, but I estimate the total time spent around weight 4 or 5.

Thanks @mkozono! Are you happy to still leave it as one issue? I think having four MRs against it and utilising a checklist is a bit more straightforward than creating an epic. Less overhead.

In any case, putting it on the build board now.

@fzimmer Yes I'm happy with that.

Perfect :)

added needs weight workflowplanning breakdown labels

mentioned in issue #17711 (closed)

marked this issue as related to #17711 (closed)

added workflowscheduling label and removed workflowplanning breakdown label

added 1 deleted label

Got one more suggestion for this task... I've found that Geo sometimes thinks that a file is replicated but it doesn't exist on disk. I think this is a hangover from the cutover from the old Geo replication methodology to the current but I'm finding a few LFS files that Geo::FileRegistry thinks are replicated but which are not on disk.. eg

irb(main):002:0> LfsObject.find_by_oid('7923fdfe304391b6722c8108c75b2cba9b5ee96ddb256455338e60b94fb345d0').id
=> 819581

irb(main):003:0> LfsObject.find_by_oid('7923fdfe304391b6722c8108c75b2cba9b5ee96ddb256455338e60b94fb345d0').file.path
=> "/gitlab/gitlab-lfs/79/23/fdfe304391b6722c8108c75b2cba9b5ee96ddb256455338e60b94fb345d0"

root@git-mirror-hkdc-04:~ # ls -l /gitlab/gitlab-lfs/79/23/fdfe304391b6722c8108c75b2cba9b5ee96ddb256455338e60b94fb345d0
ls: cannot access /gitlab/gitlab-lfs/79/23/fdfe304391b6722c8108c75b2cba9b5ee96ddb256455338e60b94fb345d0: No such file or directory

irb(main):003:0> Geo::FileRegistry.where(file_type: "lfs", file_id: 819581)
=> #<ActiveRecord::Relation [#<Geo::LfsObjectRegistry id: 718216, file_type: "lfs", file_id: 819581, bytes: 206180, sha256: nil, created_at: "2019-08-22 13:59:52", success: true, retry_count: 0, retry_at: nil, missing_on_primary: false>]>

Just deleting the Geo::FileRegistry entry is enough to trigger Geo to resync it.

I wrote this rake task to clean up this condition:

 task run_lfs_object_validator: :environment do

    unless Gitlab::Geo.secondary?
      abort 'This is not a secondary node'
    end

    delete_missing = ENV['DELETE'] || false
    from_lfs_id = ENV['FROM_LFS_ID'] || Geo::FileRegistry.where(file_type: "lfs").minimum(:id)
    to_lfs_id = ENV['TO_LFS_ID'] || Geo::FileRegistry.where(file_type: "lfs").maximum(:id)

    puts "  Geo Primary LFS objects: #{LfsObject.count}"  
    puts "Geo Secondary LFS objects: #{Geo::FileRegistry.where(file_type: "lfs").count}"
    puts "Checking from Geo File Registry id #{from_lfs_id} to #{to_lfs_id}"

   batch_size = 1000
   total_count = 0
   current_max_id = 0

    until current_max_id >= to_lfs_id
      current_max_id = [from_lfs_id + batch_size, to_lfs_id + 1].min

     puts "Checking Geo:FileRegistry ids from #{from_lfs_id} to #{current_max_id}"

     lfs_ids = Geo::FileRegistry
           .where('file_type = ? AND id >= ? AND id < ?', 'lfs', from_lfs_id, current_max_id)
           .pluck(:id)     

     missing_lfs_files = Array.new
     lfs_ids.each do |lfs|

         lfs_file_path = LfsObject.find(Geo::FileRegistry.find(lfs).file_id).file.path
         if not File.file?(lfs_file_path) then
             total_count += 1
             Geo::FileRegistry.find(lfs).delete if delete_missing
             puts " LFS id #{lfs} not found on disk"
         end

     end

      from_lfs_id = current_max_id
      
    end

    puts "Found #{total_count} missing LFS objects"

end

I've found that Geo sometimes thinks that a file is replicated but it doesn't exist on disk.

@pherlihy This sounds like a distinctly different problem. This issue's proposal makes sense to run only on a primary. I think your comment here is worthy of a separate issue and investigation by groupgeo since it is a bug for DR.

I've opened #36752 (closed).

For transparency, while this issue is marked as groupgeo (as the Geo Team will perform the work), this work is being done in the domain of groupsource code. (/cc @m_gill)

Since this rake task will be deleting files, we need to ensure that we have a robust set of test cases in place that need to surround this work. @jennielouie Making sure you are aware of this issue as it would be good to have your opinion on the tests here too.

added [deprecated] Accepting merge requests label

@mkozono Apologies if this is a silly question, but is there a reason why we can't use this https://github.com/git-lfs/git-lfs/blob/master/docs/man/git-lfs-prune.1.ronn ?

Maybe because the LFS files are not stored in the repository itself? AFAIK they are stored in some shared location.

Not a silly question, but @fzimmer is correct, we can't rely on git lfs prune because we also need to handle non-local files. Also I think it would orphan lfs_objects records.

Sorry @rnienaber, I remembered wrong. The main reason I think we can't use git lfs prune is because it's designed to be run by an individual user, with the assumption that all the LFS files are available in the remote.

Example:

I'm using a repo which has lots of LFS files in its history. I ran out of local disk space, so I run git lfs prune. This deletes all LFS files that are not referenced in the current state of the codebase.

But if I checkout an old commit that references an old LFS file, I can pull all those old LFS files down from the remote.

So, if I am the remote, then I must keep all LFS files that are referenced at any point in the repo's history, otherwise, we have dead/invalid references.

mentioned in issue #36752 (closed)

Conveniently, someone once requested a way to scan a repo for all LFS file references so they could clean up their LFS server, and someone generously implemented it as git lfs ls-files --all.

So we could compare that list with the LFS object OIDs in the DB, delete any orphaned LfsObjectsProject, and then call LfsObject.destroy_unreferenced.

@jacobvosmaer-gitlab do you have any concerns with using this command/approach?

To add more context so you don't have to go back too far: I ask because my understanding is that Gitaly is supposed to protect itself from harm, and I assume this command may be slower on some projects than some Gitaly timeout.

However, we accept in cases like this (see other cleanup rake tasks that find all files of a kind) that an administrator may need to do something slow to correct a situation. This rake task would initially be written to operate on a single project at a time. If a sysadmin chooses to write a script to run it on all projects, that is up to them, and whether they think their infrastructure can take it. This rake task could also be used as a way to iterate on improving performance or reducing load, while providing immediate value for many sysadmins.

We currently don't install git-lfs on Gitaly servers at all. So if you want to use git lfs ls-files that is one thing you need to deal with (make sure it's installed in omnibus etc.).

@mkozono There is no reason we can't have slow RPC calls. The crucial question is, when do we make those calls. And how do we avoid running the same slow call 1000 times (in a row or in parallel) if somebody e.g. pushes 1000 branches. That is: do it in Sidekiq, and manage how often you do it per repo.

Some of the existing LFS tracking stuff runs during Git push hooks, which is just a plain bad idea: Git hook validations are implemented as HTTP API calls, which have stricter timeouts than Sidekiq jobs, and these calls run synchronously during the push, so the user has to sit and wait for these calls to finish before their local git push command is done.

We currently don't install git-lfs on Gitaly servers at all. So if you want to use git lfs ls-files that is one thing you need to deal with (make sure it's installed in omnibus etc.).

@jacobvosmaer-gitlab I did not consider this, thank you for pointing it out.

@ibaum Do you see any issue with installing git-lfs? What would be the preferred way to install it? https://github.com/git-lfs/git-lfs#getting-started

@mkozono There is no reason we can't have slow RPC calls. The crucial question is, when do we make those calls.

Ah, good. My intention is to take a careful approach to avoid misuse of these calls.

the user has to sit and wait for these calls to finish

Thank you for this context, I will keep it in mind.

@mkozono I don't think it would be a bad idea to include it

For omnibus, we'd want to build from source via an entry in config/software/
For the helm charts, we'd probably want to add it to the git-base image in https://gitlab.com/gitlab-org/build/CNG/

Regarding git lfs, I would ~~avoid~~ install the binary but nothing more. The recommended way to install integrates it into your local git installation and we don't want that on the gitaly server. E.g. when we do a git rebase with a worktree, we don't want LFS to start downloading LFS blobs to the gitaly server.

Edit: fixed typo. I meant: install the binary so that it's in PATH but nothing more.

we don't want LFS to start downloading LFS blobs to the gitaly server.

Great point-- @ibaum do you agree with just installing the binary?

@mkozono Absolutely

Great, thanks both!

(This thread is outdated, see later discussion below.)

changed weight to 4

changed the description

added workflowready for development label and removed needs weight workflowscheduling labels

Create a rake task to cleanup unused LFS files

Problem to solve

Intended users

Further details

Ideal solution

Proposal

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Links / references

Designs

Child items ...

Activity

Create a rake task to cleanup unused LFS files

Problem to solve

Intended users

Further details

Ideal solution

Proposal

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Links / references

Relates to

Activity