Implement S3 compatible uploads for unknown content-length
New description proposal
Based on gitlab-org/gitlab-ee#4184 we know that only GCS (Google Cloud Storage) accepts chunked-encoding
, all the other S3-compatible Object Storage requires Content-Length
HTTP header.
In order to support S3 compatible Object Storage we need to split the incoming file in parts that will be uploaded as different objects, then we can leverage MultipartUpload
API calls.
Without this feature, only LFS can take advantage of workhorse direct uploads. Artifacts and user uploads require this to be implemented in order to work on S3 compatible object storages.
This issue regards artifacts uploads only, but the outcome will be useful implementing gitlab-org/gitlab-ce#44663
note: Only for S3-compatible API, we still need to dump a copy of the file to the local disk in order to extract artifacts metadata, future development may improve this implementing a multi-object HTTPReadSeeker in workhorse.
What needs to be changed
-
/authorize
reply must declare if workhorse must split the upload in multiple parts, we will useMultipartUpload
key to explicitly set how to split the upload. If the key is missing workhorse will assume that no split is required. -
/authorize
call, when Object Storage is on AWS, must Initiate Multipart Upload and generate a bunch of pre-signed URLs for the following operations:
/finalize
call may include a list of parts (optionally with ETAG) instead of just a single url, when the list of parts will be present rails will perform aMultipartUpload
based on the provided chunks.- When a
MultipartUpload
is expected workhorse must upload chunks sequentially, Part size must be known before uploading, so workhorse will write one part on disk, seek to the beginning and upload it. This file will be overwritten by each part. (Maximum disk usageMultipartUpload.PartSize
* cuncurrent uploads)
Example sequence diagram
The diagram is not completely accurate, it's just to explain interactions and API calls
sequenceDiagram
participant r as gitlab-runner
participant w as gitlab-workhorse
participant u as gitlab-unicorn
participant os as Object Storage
r->>+w: upload_artifact
w->>+u: authorize
Alt provider == 'AWS'
u->>+os: InitiateMultipartUpload
os-->>-u: uploadId
end
Note over u: generate a bunch of presigned URLs
u-->>-w: authorisation
Alt authorisation.RemoteObject.MultipartUpload == nil
Note over w,os: Only on GCS we can upload without Content-Length (we need to use chunked-encoding)
w->>+os: PutObject
os-->>-w: result
else
Loop every authorisation.RemoteObject.MultipartUpload.PartSize MB of file
w->>+os: UploadPart
os-->>-w: eTag
end
w->>+os: CompleteMultipartUpload(ETags)
os-->>-w: result
Opt something went wrong
w->>+os: AbortMultipartUpload
os-->>-w:
end
end
w->>+os: extract metadata using HTTPReadSeeker
os-->>-w: seek and read the zip table
w->>+u: finalize
u->>+os: Copy object to the final location
os-->>-u:
u-->>-r: operation result
Note over w: Now we can remove the local file (if present)
w->>+os: RemoveObject
os-->>-w:
deactivate w
Data Structure changes
/authorize
answer)
RemoteObject (part of /authorize
answer will introduce MultipartUpload
of type MultiPartUploadParams
type MultipartUploadParams struct {
// PartSize is the exact size of each uploaded part. Only the last one can be smaller
PartSize int64
// PartsURL contains the presigned URLs for each part
PartsURL []string
// CompleteURL is a presigned URL for CompleteMulipartUpload
CompleteURL string
// AbortURL is a presigned URL for AbortMultipartUpload
AbortURL string
}
type RemoteObject struct {
// GetURL is an S3 GetObject URL
GetURL string
// DeleteURL is a presigned S3 RemoveObject URL
DeleteURL string
// StoreURL is the temporary presigned S3 PutObject URL to which upload the first found file
StoreURL string
// ID is a unique identifier of object storage upload
ID string
// Timeout is a number that represents timeout in seconds for sending data to StoreURL
Timeout int
// MultipartUpload contains presigned URLs for S3 MultipartUpload
MultipartUpload *MultipartUploadParams
}
Actions
Implementing this feature requires workhorse
and gitlab-ce
MRs
-
checking how much time does it take to glue the S3 parts into one object, will it fit into the Unicorn timeout? -
Workhorse MR: gitlab-workhorse!257 (merged) -
GitLab MR to generate presigned uploads: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/18855
Same POC in go
can be found in https://gitlab.com/nolith-tests/multiupload/tree/master