Avoid copying objects from one bucket to another for CI artifacts
As discussed in #216442 (comment 446132728), Workhorse uploads a file to a temporary location in a bucket, and then Rails copies and deletes this file into its final location:
However, for large files, this copy step can take minutes for large files if the underlying Fog driver isn't optimized for parallel, multipart uploads, as all of them are (fog-aws will get this feature via https://github.com/fog/fog-aws/pull/579.). As a result, the Rails process step will fail, and the upload will fail.
In many cases, this store-and-copy may be unnecessary in the object storage case because the transfer has already been authorized, and partially uploaded files shouldn't get stored in object storage--especially if we are using the multipart upload that has a SHA256 checksum. For example, with CI artifacts, the files are stored by project and and job ID: https://gitlab.com/gitlab-org/gitlab/blob/614dbf8dc967fd736f9028360a88b765f6aaf9c9/app/uploaders/job_artifact_uploader.rb#L40.
@nolith points out that concurrent uploads may be a concern: if we generate the same filename for a different upload, we could run into trouble. But for CI artifacts at least, if there are two runners uploading the same artifact, then the last object would be retained even in the previous scenario.
For file attachments, we generate a random SHA, so it does seem unlikely this would happen there.
CarrierWave does this store-and-copy for the local filesystem, which does have an issue with partial uploads.
LFS may be an exception here because we may want to quarantine the object since we may want to verify its SHA256 and ensure that it matches the file. LFS files are also de-duplicated, so we shouldn't need to copy a file if it already exists.
POC
Proposal:- Create new artifact path when authorizing job artifact creation
- Set new path in workhorse options
- Modify workhorse structure to introduce new param:
RemoteFinalStorePath
- Do not delete the file if the path should be final
- Tells
CarrierWave
to not copy the file and by pointing to the uploaded file by workhorse. Overridecache!
method in https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/uploaders/job_artifact_uploader.rb
- Rails currently uses a temporary path (https://gitlab.com/gitlab-org/gitlab/blob/2e60e6ad870edd8170bd51e4f47909fc662d0890/app/uploaders/object_storage.rb#L207-208). We allow certain uploads to flag whether this is a final path.
- Workhorse will not delete this file if it that flag is set.
- Verify CarrierWave's store method works if the source and target bucket and object name are the same. We may have to do this ourselves and skip the COPY step.
This will solve #216442 (closed) without much effort.
@nolith, @nick.thomas What do you think?