LFS: The `batch request` for download should prefer to use direct object storage access
Problem
GitLab due to how it serves /lfs/objects/batch
triggers LFS object fetch storm, which is especially
problematic for repositories having a lot of LFS objects. This creates a storm of a lot of very small
requests, each just re-checking if user/token have permission to access the object and redirect to the file.
In most efficient case the LFS objects are stored externally (on Object Storage, configured to directly download them from S3 bucket).
This makes those requests redundant since their only purpose is to redirect to externally stored object
with HTTP: 302
, and Location:
.
Proposal
Make /lfs/objects/batch
to support direct download of objects:
- serve pre-signed URL to S3 bucket instead of GitLab URL to serve object
- the pre-signed URL is time-limited to the same duration as fetched via
/gitlab-lfs/objects/sha256oid
The git-lfs
by default uses batch requests of 100 objects at a time. This seems to be a sane value
for presigned URLs served via batch.
Security considerations
This should not pose any risks, since we already serve time-limited pre-signed URL via /gitlab-lfs/objects/sha256oid
.
The only difference would be that we would perform all those checks before and cut the middle-request from the lifecycle.
This change would only impact download
requests, the upload
would continue going through GitLab to proxy them to Object Storage.
Details
When repository via git-lfs
fetches/uploads objects it does this using two operations:
- Fetch in batches of 100:
/path/to/repo.git/info/lfs/objects/batch
to request all objects to download or upload - GitLab in response returns each object with
/path/to/repo.git/gitlab-lfs/objects/sha256oid
/lfs/objects/batch
The -
git-lfs
sends a request of the following form:
{
'operation' => 'download',
'objects' => [
{ 'oid' => '91eff75a492a3ed0dfcb544d7f31326bc4014c8551849c192fd1e48d4dd2c897',
'size' => 1000 }
]
}
- GitLab in response does return:
{
objects: [
{ 'oid' => '91eff75a492a3ed0dfcb544d7f31326bc4014c8551849c192fd1e48d4dd2c897',
'size' => 1000,
'actions' => {
'download' => {
'href' => "https://example.com/path/to/repo/gitlab-lfs/objects/91eff75a492a3ed0dfcb544d7f31326bc4014c8551849c192fd1e48d4dd2c897",
'header' => {
'Authorization': authorization_header
}.compact
}
}
'authenticated' => true,
'error' => {
'code' => 404,
'message' => "Object does not exist on the server or you don't have permissions to access it"
}
}
]
}
/path/to/repo.git/gitlab-lfs/objects/sha256oid
The -
git-lfs
just doesGET /path/to/repo.git/gitlab-lfs/objects/sha256oid
with headers passed. - GitLab does check
user/token
permissions and checks for a physical existence of the object on OS. - GitLab depending on location of LFS object (local file, Object Storage and
proxy_download: false
) does either return object or does sendLocation:
request. The serving of object if proxied is done by Workhorse.
module Repositories
class LfsStorageController < Repositories::GitHttpClientController
def download
lfs_object = LfsObject.find_by_oid(oid)
unless lfs_object && lfs_object.file.exists?
render_lfs_not_found
return
end
send_upload(lfs_object.file, send_params: { content_type: "application/octet-stream" })
end
end
end