Import from amazon S3

Problem

When importing from a remote object storage, Gitlab validates content-length and content-type before starting the import to give feedback sooner to the users if the import will succeed. The validation happens using a HTTP HEAD request to the given URL to validate the content-length and content-type headers of the file.

But, Amazon S3 presigned URLs only respond to one HTTP verb, by default GET, which always returns the full file in the response. To avoid downloading big files to validate them, for now when the import comes from Amazon S3 presigned URLs we're skipping the validation in the early stages of the import !75170 (comment 748059103).

Proposed solution

Create a new endpoint like POST /projects/aws-s3-import where the user can pass S3 specific information required to retrive the file:

access_key_id
secret_access_key
bucket_name
file_key

Then, using the AWS-S3 gem (https://github.com/aws/aws-sdk-ruby), which is already in GitLab, we could validate the file and create the URL to be saved in the database. Something like:

s3_client = Aws::S3::Client.new(access_key_id: params[:access_key_id], secret_access_key: params[:secret_access_key])
file = Aws::S3::Object.new(params[:bucket_name], params[:file_key], client: s3_client)
file.content_length # => retrieves the file size
file.content_type   # => retrieves the file type
file.presigned_url(:get, expires_in: 2.days.seconds.to_i) # => creates the presigned URL to be saved and used to do the import

original discussion

The following discussion from !75170 (merged) should be addressed:

@reprazent started a discussion: (+11 comments)

Wouldn't changing this result in the body of the response containing the entire archive? I don't think that's something we'd want to do from the web request that creates the project, right?

Would it perhaps be better to check one of the other x-amz headers' presence and skip the content-length check in those cases? https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html#API_HeadObject_ResponseSyntax

We'd have to handle the error when we do download the file in Sidekiq and there turns out to be no content.