Skip to content

LFS: The `batch request` for download should prefer to use direct object storage access

Problem

GitLab due to how it serves /lfs/objects/batch triggers LFS object fetch storm, which is especially problematic for repositories having a lot of LFS objects. This creates a storm of a lot of very small requests, each just re-checking if user/token have permission to access the object and redirect to the file.

In most efficient case the LFS objects are stored externally (on Object Storage, configured to directly download them from S3 bucket). This makes those requests redundant since their only purpose is to redirect to externally stored object with HTTP: 302, and Location:.

Proposal

Make /lfs/objects/batch to support direct download of objects:

  1. serve pre-signed URL to S3 bucket instead of GitLab URL to serve object
  2. the pre-signed URL is time-limited to the same duration as fetched via /gitlab-lfs/objects/sha256oid

The git-lfs by default uses batch requests of 100 objects at a time. This seems to be a sane value for presigned URLs served via batch.

Security considerations

This should not pose any risks, since we already serve time-limited pre-signed URL via /gitlab-lfs/objects/sha256oid. The only difference would be that we would perform all those checks before and cut the middle-request from the lifecycle.

This change would only impact download requests, the upload would continue going through GitLab to proxy them to Object Storage.

Details

When repository via git-lfs fetches/uploads objects it does this using two operations:

  • Fetch in batches of 100: /path/to/repo.git/info/lfs/objects/batch to request all objects to download or upload
  • GitLab in response returns each object with /path/to/repo.git/gitlab-lfs/objects/sha256oid

The /lfs/objects/batch

  1. git-lfs sends a request of the following form:
{
  'operation' => 'download',
  'objects' => [
    { 'oid' => '91eff75a492a3ed0dfcb544d7f31326bc4014c8551849c192fd1e48d4dd2c897',
      'size' => 1000 }
  ]
}
  1. GitLab in response does return:
{
  objects: [
    { 'oid' => '91eff75a492a3ed0dfcb544d7f31326bc4014c8551849c192fd1e48d4dd2c897',
      'size' => 1000,
      'actions' => {
        'download' => {
          'href' => "https://example.com/path/to/repo/gitlab-lfs/objects/91eff75a492a3ed0dfcb544d7f31326bc4014c8551849c192fd1e48d4dd2c897",
          'header' => {
            'Authorization': authorization_header
          }.compact
        }
      }
      'authenticated' => true,
      'error' => {
        'code' => 404,
        'message' => "Object does not exist on the server or you don't have permissions to access it"
      }
    }
  ]
}

The /path/to/repo.git/gitlab-lfs/objects/sha256oid

  1. git-lfs just does GET /path/to/repo.git/gitlab-lfs/objects/sha256oid with headers passed.
  2. GitLab does check user/token permissions and checks for a physical existence of the object on OS.
  3. GitLab depending on location of LFS object (local file, Object Storage and proxy_download: false) does either return object or does send Location: request. The serving of object if proxied is done by Workhorse.
module Repositories
  class LfsStorageController < Repositories::GitHttpClientController
    def download
      lfs_object = LfsObject.find_by_oid(oid)
      unless lfs_object && lfs_object.file.exists?
        render_lfs_not_found
        return
      end

      send_upload(lfs_object.file, send_params: { content_type: "application/octet-stream" })
    end
  end
end
Edited by Kamil Trzciński