Draft: ActiveStorage experiment
🌚 Context
In #348959 (closed), we opened the discussion about removing the dependency to CarrierWave for file uploads.
The idea revolves around two tables:
-
blobs: this table holds the information on the file upload and more importantly, the key under which is the file is stored (in object storage or the file system) -
attachments: this is simply the link between a single blob and whatever business model makes sense (polymorphic assocation)
This approach is really similar to what ActiveStorage uses.
This led us to ask ourselves:
- Can we use ActiveStorage for GitLab uploads?
- If yes, what's the price?
- If no, what are the blocking points?
🔭 Scope
The scope of this experiment is highly tight due to the amount of time I have at my disposal for this.
We're going to try adding ActiveStorage support for the generic packages registry. That registry works solely with body uploads. We're going to focus on two user actions:
- Uploading a generic package to GitLab.
- Download the generic package from GitLab.
To simplify things, we're going to use these settings:
- GitLab connected to an S3 object storage (minio)
- Non consolidated object storage configuration.
GDK config:
object_store:
enabled: true
consolidated_form: false
Additional restrictions:
-
blobsandattachmentstables should be prefixable. In our case, we should save rows inpackages_blobsandpackages_attachments.
Additional remarks:
- The code changes are not meant to be production ready.
- The code changes will use several shortcuts.
🗒 Summary of changes
🗄 Active Storage integration
- Enable the
active_storageframework - Configure it so that (initializer)
- ActiveStorage routes are not added to the application router
- We don't have any
previewersandanalyzers - Generate the service configurations from the GitLab config
- "Disable" variants
- Adds a new macro:
has_one_file_attachedthat describes one file attachment on whatever name - Overrides methods in
ActiveStorage::Attached::Changes::CreateOneto not directly referenceActiveStorage::AttachmentorActiveStorage::Blob - Adds a migrations for tables
packages_blobsandpackages_attachments - Adds a custom
ActiveStorage::Servicefor S3- Overrides the pre signed upload url generation to remove
content_typeandcontent_sizesigning.
- Overrides the pre signed upload url generation to remove
🐎 Workhorse changes
- Update
workhose_authorizeso that the blob class can be passed. - Adds a
ActiveStorage::DirectUploadclass to handle blobs creation. - Handles the
BlobSignedIdparameter during theauthorizeandfinalizecalls.
📦 Packages Registry changes
- On the
authorizehandler, pass the blob class. - On the
finalizehandler, read the blob signed id. - In the package file creation service, accept a blob signed id and call
attachwith it. - In the endpoint that returns the package, simply redirect to the signed url of object storage
🍿 Demo
🔼 Upload
$ less dummy.txt
bananas!
$ curl --header "PRIVATE-TOKEN: <PAT>" --upload-file ./dummy.txt \
"http://gdk.test:8000/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt"
{"message":"201 Created"}
Rails logs (truncated):
Started PUT "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt/authorize" for 127.0.0.1 at 2022-01-24 14:43:13 +0100
[snip snip]
TRANSACTION (0.3ms) BEGIN
↳ app/models/packages/blob.rb:13:in `create_before_direct_upload!'
Packages::Blob Create (1.3ms) INSERT INTO "packages_blobs" ("key", "filename", "content_type", "service_name", "created_at") VALUES ('b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage', '', '', 'packages', '2022-01-24 13:43:13.762835') RETURNING "id"
↳ app/models/packages/blob.rb:13:in `create_before_direct_upload!'
TRANSACTION (0.5ms) COMMIT
↳ lib/gitlab/database.rb:298:in `commit'
GitlabS3 Storage (1.6ms) Generated URL for file at key: b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage (http://127.0.0.1:9000/packages/b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20220124%2Fgdk%2Fs3%2Faws4_request&X-Amz-Date=20220124T134313Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=2df2af71843dd71ac59c41d01fa375b080b3c6e2c8cd4767acc76150ba4ff43d)
Started PUT "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" for 172.16.123.1 at 2022-01-24 14:43:13 +0100
[snip snip]
TRANSACTION (0.3ms) BEGIN
↳ app/services/packages/create_package_service.rb:14:in `find_or_create_package!'
Packages::Package Load (0.7ms) SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."project_id" = 511 AND "packages_packages"."package_type" = 7 AND "packages_packages"."name" = 'as_experiment' AND "packages_packages"."version" = '1.3.7' LIMIT 1
↳ app/services/packages/create_package_service.rb:14:in `find_or_create_package!'
Namespace::PackageSetting Load (0.3ms) SELECT "namespace_package_settings".* FROM "namespace_package_settings" WHERE "namespace_package_settings"."namespace_id" = 1 LIMIT 1
↳ app/models/namespace.rb:245:in `package_settings'
Packages::Blob Load (0.7ms) SELECT "packages_blobs".* FROM "packages_blobs" WHERE "packages_blobs"."id" = 33 LIMIT 1
↳ config/initializers/active_storage.rb:111:in `find_or_build_blob'
Packages::PackageFile Create (0.5ms) INSERT INTO "packages_package_files" ("package_id", "created_at", "updated_at", "size", "file_name", "file") VALUES (813, '2022-01-24 13:43:14.451751', '2022-01-24 13:43:14.451751', 0, 'file.txt', 'file.txt') RETURNING "id"
↳ app/services/packages/create_package_file_service.rb:26:in `execute'
Packages::PackageFile Update (0.4ms) UPDATE "packages_package_files" SET "file_store" = 1 WHERE "packages_package_files"."id" = 1529
↳ app/models/concerns/file_store_mounter.rb:17:in `update_file_store'
Packages::Blob Load (0.5ms) SELECT "packages_blobs".* FROM "packages_blobs" INNER JOIN "packages_attachments" ON "packages_blobs"."id" = "packages_attachments"."blob_id" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
↳ lib/gem_extensions/active_record/associations/has_one_through_association.rb:12:in `find_target'
Packages::Attachment Load (0.4ms) SELECT "packages_attachments".* FROM "packages_attachments" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
↳ config/initializers/active_storage.rb:64:in `block in has_one_file_attached'
Packages::Blob Update (0.4ms) UPDATE "packages_blobs" SET "content_type" = 'application/octet-stream', "metadata" = '{"identified":true}' WHERE "packages_blobs"."id" = 33
↳ lib/gitlab/database.rb:265:in `block in transaction'
Packages::Attachment Create (0.4ms) INSERT INTO "packages_attachments" ("name", "record_type", "record_id", "blob_id", "created_at") VALUES ('active_storage_file', 'Packages::PackageFile', 1529, 33, '2022-01-24 13:43:14.487185') RETURNING "id"
↳ lib/gitlab/database.rb:265:in `block in transaction'
Packages::PackageFile Update (0.6ms) UPDATE "packages_package_files" SET "updated_at" = '2022-01-24 13:43:14.492606' WHERE "packages_package_files"."id" = 1529
↳ lib/gitlab/database.rb:265:in `block in transaction'
TRANSACTION (0.7ms) COMMIT
↳ lib/gitlab/database.rb:298:in `commit'
TRANSACTION (0.3ms) BEGIN
↳ config/initializers/forbid_sidekiq_in_transactions.rb:58:in `block in committed!'
Packages::Blob Update (0.4ms) UPDATE "packages_blobs" SET "metadata" = '{"identified":true,"analyzed":true}' WHERE "packages_blobs"."id" = 33
↳ config/initializers/forbid_sidekiq_in_transactions.rb:58:in `block in committed!'
TRANSACTION (0.4ms) COMMIT
↳ lib/gitlab/database.rb:298:in `commit'
🔽 Download
$ curl -L --header "PRIVATE-TOKEN: <PAT>" "http://gdk.test:8000/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt"
bananas!
Rails logs (truncated):
Started GET "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" for 172.16.123.1 at 2022-01-24 14:49:43 +0100
[snip snip]
Packages::Package Load (0.6ms) SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."project_id" = 511 AND "packages_packages"."package_type" = 7 AND "packages_packages"."status" = 0 AND "packages_packages"."name" = 'as_experiment' AND "packages_packages"."version" = '1.3.7' LIMIT 1
↳ app/models/packages/package.rb:192:in `by_name_and_version!'
Packages::PackageFile Load (0.7ms) SELECT "packages_package_files".* FROM "packages_package_files" WHERE "packages_package_files"."package_id" = 813 AND "packages_package_files"."file_name" = 'file.txt' ORDER BY "packages_package_files"."id" DESC LIMIT 1
↳ app/finders/packages/package_file_finder.rb:16:in `execute!'
[snip snip]
Packages::Blob Load (0.6ms) SELECT "packages_blobs".* FROM "packages_blobs" INNER JOIN "packages_attachments" ON "packages_blobs"."id" = "packages_attachments"."blob_id" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
↳ lib/gem_extensions/active_record/associations/has_one_through_association.rb:12:in `find_target'
GitlabS3 Storage (1.8ms) Generated URL for file at key: b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage (http://127.0.0.1:9000/packages/b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage?response-content-disposition=attachment%3B%20filename%3D%22%22%3B%20filename%2A%3DUTF-8%27%27&response-content-type=application%2Foctet-stream&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20220124%2Fgdk%2Fs3%2Faws4_request&X-Amz-Date=20220124T134943Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=9967549d01d4cecc7544787785b64a5d883586383c47639965c05f4afa6d1b34)
💬 Discussion
💾 Object Storage interactions
Current situation:
- Uploading
- Workhorse: 2 (1 PUT, 1 DELETE)
- Rails: 4 (1 HEAD, 1 PUT (copy), 1 HEAD, 1 DELETE)
- Download (returning a redirect)
- Workhorse: 0
- Rails: 0
With ActiveStorage:
-
Uploading
- Workhorse: 1 PUT
- Rails: 0
-
Download (returning a redirect)
- Workhorse: 0
- Rails: 0
-
No changes for the download operation
-
6->1requests for the upload operation.- In particular, there is no PUT (copy) request anymore!
🎉
- In particular, there is no PUT (copy) request anymore!
-
Reducing the amount of interactions with object storage during the upload grants:
- Faster handling.
- Reduced overall cost as those requests are not free.
👍 Upsides
- The
finalizecall doesn't need to be handled by a middleware. It is now simplified to parameters handling: reading theblob_signed_idparameter and act upon it. - The
#attachapi is really developer friendly. Example: pass a blob signed id to it and it will create the attachment for you if the object is persisted, otherwise it will record an attachment creation for when the business model is saved. - Easily compatible with our current Workhorse <-> Rails logic flows (
authorizeandfinalizerequests) - the
_blobstable becomes the single source of truth for uploaded files:- if a file on object storage is not in this table, it shouldn't exist in object storage
- the responsibility of cleaning up dangling blobs (blobs that have no attachments) falls in rails: Workhorse can avoid sending DELETE requests (=simplification).
- The
authorizecall works with a reference (the blob signed id) = thefinalizecall doesn't need to pass all the information around the upload, passing the reference is enough = we can certainly not use a middleware anymore because passing the blob signed id is enough for rails to locate the upload during thefinalizecall.
- The previous point has an impact on specs: we don't need to test against all the possible forms of uploaded files. Testing against: 1. a valid blob signed id and 2. an invalid or absent blob signed id is enough to cover all the situations.
- Basically, developers can ignore how the blob was created and uploaded and fully focused on testing what happens when there is a valid and invalid blob signed id.
- The concept of a blob signed id is really valuable. Rails can protect blob id. This way, if Workhorse or
👿 users mess up with theblob_signed_id, thefinalizecall will get rejected. - The approach of
blobsandattachmentstable is compatible with a background migration that will move the existing uploads to those tables. - ActiveStorage is being actively worked on as part of the rails framework.
- We can easily implement a "gitlab" version of the adapters (s3 service) to add additional options.
- Moving an uploaded file or removing it becomes trivial: simply update or remove the row from the
attachmentstable.- This creates dangling
blobsbut we can easily imagine having a cleanup worker that will take care of those.
- This creates dangling
👎 Downsides
- We have a lot of overrides within ActiveStorage because of:
- variants that we will never really use.
- several references to
ActiveStorage::AttachmentandActiveStorage::Blob.
- ActiveStorage seems to be geared towards "media" files such as images and videos whereas the bulk of uploads in GitLab is simply archives or binaries objects (zip or tar.gz files)
- There are several features are these "media" files such as analyzers and previewers that we don't really need for now. Disabling them for good means more overrides.
- Those overrides feel fragile or breakable easily.
- ActiveStorage macros
has_one_attachedandhas_many_attachedare still accessible. - The
attachmentstable is using a polymorphic association which is not recommend by the database guidelines.
🎩 Conclusions
- Can we use ActiveStorage for GitLab uploads? That's a yes.
- If yes, what's the price? Several overrides. Some of them are quite deep within ActiveStorage.
I think that (2.) is a price to high to pay for the scale of GitLab. I don't think we can afford running ActiveStorage with several bandages around it to "bend" it GitLab needs. The result will not:
- look professional
- be reliable.
- Eg. bumping ActiveStorage or Rails could break the overrides.
Having said that, the blobs + attachments approach is a really valuable and simple one. We should aim for that same simplicity.
I would thus suggest following #348959 (closed) with a custom implementation but take heavy inspirations from ActiveStorage:
- blob signed ids
- service/adapter approach to handle all the interactions with the different object storage backends (including FileStore).
- re-use the
attachAPI as it is a really powerful one.
If one day, ActiveStorage provides optional variants features and different blobs/attachments tables, we can revisit if that jump is valuable.
🔮 Other considerations
While working on this, I noticed other improvements we could implement:
- Make the
authorizeresponse compatible with returning multiple blob signed ids. This way, workhorse could ask for two files at once (example: when handling CI artifacts uploads) - the object storage configuration is passed back to Workhorse in the
authorizeresponse. I've been wondering if we could pass the credentials too (in an encrypted manner).🤔 - This way, workhorse can use an official client to upload the file (just as it does with the consolidated configuration)