Skip to content

Implement S3 compatible uploads for unknown content-length

New description proposal

Based on gitlab-org/gitlab-ee#4184 we know that only GCS (Google Cloud Storage) accepts chunked-encoding, all the other S3-compatible Object Storage requires Content-Length HTTP header.

In order to support S3 compatible Object Storage we need to split the incoming file in parts that will be uploaded as different objects, then we can leverage MultipartUpload API calls.

Without this feature, only LFS can take advantage of workhorse direct uploads. Artifacts and user uploads require this to be implemented in order to work on S3 compatible object storages.

This issue regards artifacts uploads only, but the outcome will be useful implementing gitlab-org/gitlab-ce#44663

note: Only for S3-compatible API, we still need to dump a copy of the file to the local disk in order to extract artifacts metadata, future development may improve this implementing a multi-object HTTPReadSeeker in workhorse.

What needs to be changed

  1. /authorize reply must declare if workhorse must split the upload in multiple parts, we will use MultipartUpload key to explicitly set how to split the upload. If the key is missing workhorse will assume that no split is required.
  2. /authorize call, when Object Storage is on AWS, must Initiate Multipart Upload and generate a bunch of pre-signed URLs for the following operations:
  1. /finalize call may include a list of parts (optionally with ETAG) instead of just a single url, when the list of parts will be present rails will perform a MultipartUpload based on the provided chunks.
  2. When a MultipartUpload is expected workhorse must upload chunks sequentially, Part size must be known before uploading, so workhorse will write one part on disk, seek to the beginning and upload it. This file will be overwritten by each part. (Maximum disk usage MultipartUpload.PartSize * cuncurrent uploads)

Example sequence diagram

The diagram is not completely accurate, it's just to explain interactions and API calls

sequenceDiagram
    participant r as gitlab-runner
    participant w as gitlab-workhorse
    participant u as gitlab-unicorn
    participant os as Object Storage

    r->>+w: upload_artifact
    w->>+u: authorize
    Alt provider == 'AWS'
      u->>+os: InitiateMultipartUpload
      os-->>-u: uploadId
    end
    Note over u: generate a bunch of presigned URLs 
    u-->>-w: authorisation
    
  Alt authorisation.RemoteObject.MultipartUpload == nil
   Note over w,os: Only on GCS we can upload without Content-Length (we need to use chunked-encoding)
    w->>+os: PutObject
    os-->>-w: result
   else
    Loop every authorisation.RemoteObject.MultipartUpload.PartSize MB of file
      w->>+os: UploadPart
      os-->>-w: eTag
   end
   w->>+os: CompleteMultipartUpload(ETags)
   os-->>-w: result
   Opt something went wrong
      w->>+os: AbortMultipartUpload
      os-->>-w: 
   end
  end
    w->>+os: extract metadata using HTTPReadSeeker
    os-->>-w: seek and read the zip table
    w->>+u: finalize
    u->>+os: Copy object to the final location
    os-->>-u: 
    u-->>-r: operation result

   Note over w: Now we can remove the local file (if present)
  w->>+os: RemoveObject
  os-->>-w: 
  deactivate w

View diagram

Data Structure changes

RemoteObject (part of /authorize answer)

/authorize answer will introduce MultipartUpload of type MultiPartUploadParams

type MultipartUploadParams struct {
	// PartSize is the exact size of each uploaded part. Only the last one can be smaller
	PartSize int64
	// PartsURL contains the presigned URLs for each part
	PartsURL []string
	// CompleteURL is a presigned URL for CompleteMulipartUpload
	CompleteURL string
	// AbortURL is a presigned URL for AbortMultipartUpload
	AbortURL string
}

type RemoteObject struct {
	// GetURL is an S3 GetObject URL
	GetURL string
	// DeleteURL is a presigned S3 RemoveObject URL
	DeleteURL string
	// StoreURL is the temporary presigned S3 PutObject URL to which upload the first found file
	StoreURL string
	// ID is a unique identifier of object storage upload
	ID string
	// Timeout is a number that represents timeout in seconds for sending data to StoreURL
	Timeout int
	// MultipartUpload contains presigned URLs for S3 MultipartUpload
	MultipartUpload *MultipartUploadParams
}

Actions

Implementing this feature requires workhorse and gitlab-ce MRs

Same POC in go can be found in https://gitlab.com/nolith-tests/multiupload/tree/master

Edited by Stan Hu