Draft: ActiveStorage experiment
🌚 Context
In #348959, we opened the discussion about removing the dependency to CarrierWave for file uploads.
The idea revolves around two tables:
-
blobs
: this table holds the information on the file upload and more importantly, the key under which is the file is stored (in object storage or the file system) -
attachments
: this is simply the link between a single blob and whatever business model makes sense (polymorphic assocation)
This approach is really similar to what ActiveStorage uses.
This led us to ask ourselves:
- Can we use ActiveStorage for GitLab uploads?
- If yes, what's the price?
- If no, what are the blocking points?
🔭 Scope
The scope of this experiment is highly tight due to the amount of time I have at my disposal for this.
We're going to try adding ActiveStorage support for the generic packages registry. That registry works solely with body uploads. We're going to focus on two user actions:
- Uploading a generic package to GitLab.
- Download the generic package from GitLab.
To simplify things, we're going to use these settings:
- GitLab connected to an S3 object storage (minio)
- Non consolidated object storage configuration.
GDK config:
object_store:
enabled: true
consolidated_form: false
Additional restrictions:
-
blobs
andattachments
tables should be prefixable. In our case, we should save rows inpackages_blobs
andpackages_attachments
.
Additional remarks:
- The code changes are not meant to be production ready.
- The code changes will use several shortcuts.
🗒 Summary of changes
🗄 Active Storage integration
- Enable the
active_storage
framework - Configure it so that (initializer)
- ActiveStorage routes are not added to the application router
- We don't have any
previewers
andanalyzers
- Generate the service configurations from the GitLab config
- "Disable" variants
- Adds a new macro:
has_one_file_attached
that describes one file attachment on whatever name - Overrides methods in
ActiveStorage::Attached::Changes::CreateOne
to not directly referenceActiveStorage::Attachment
orActiveStorage::Blob
- Adds a migrations for tables
packages_blobs
andpackages_attachments
- Adds a custom
ActiveStorage::Service
for S3- Overrides the pre signed upload url generation to remove
content_type
andcontent_size
signing.
- Overrides the pre signed upload url generation to remove
🐎 Workhorse changes
- Update
workhose_authorize
so that the blob class can be passed. - Adds a
ActiveStorage::DirectUpload
class to handle blobs creation. - Handles the
BlobSignedId
parameter during theauthorize
andfinalize
calls.
📦 Packages Registry changes
- On the
authorize
handler, pass the blob class. - On the
finalize
handler, read the blob signed id. - In the package file creation service, accept a blob signed id and call
attach
with it. - In the endpoint that returns the package, simply redirect to the signed url of object storage
🍿 Demo
🔼 Upload
$ less dummy.txt
bananas!
$ curl --header "PRIVATE-TOKEN: <PAT>" --upload-file ./dummy.txt \
"http://gdk.test:8000/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt"
{"message":"201 Created"}
Rails logs (truncated):
Started PUT "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt/authorize" for 127.0.0.1 at 2022-01-24 14:43:13 +0100
[snip snip]
TRANSACTION (0.3ms) BEGIN
↳ app/models/packages/blob.rb:13:in `create_before_direct_upload!'
Packages::Blob Create (1.3ms) INSERT INTO "packages_blobs" ("key", "filename", "content_type", "service_name", "created_at") VALUES ('b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage', '', '', 'packages', '2022-01-24 13:43:13.762835') RETURNING "id"
↳ app/models/packages/blob.rb:13:in `create_before_direct_upload!'
TRANSACTION (0.5ms) COMMIT
↳ lib/gitlab/database.rb:298:in `commit'
GitlabS3 Storage (1.6ms) Generated URL for file at key: b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage (http://127.0.0.1:9000/packages/b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20220124%2Fgdk%2Fs3%2Faws4_request&X-Amz-Date=20220124T134313Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=2df2af71843dd71ac59c41d01fa375b080b3c6e2c8cd4767acc76150ba4ff43d)
Started PUT "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" for 172.16.123.1 at 2022-01-24 14:43:13 +0100
[snip snip]
TRANSACTION (0.3ms) BEGIN
↳ app/services/packages/create_package_service.rb:14:in `find_or_create_package!'
Packages::Package Load (0.7ms) SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."project_id" = 511 AND "packages_packages"."package_type" = 7 AND "packages_packages"."name" = 'as_experiment' AND "packages_packages"."version" = '1.3.7' LIMIT 1
↳ app/services/packages/create_package_service.rb:14:in `find_or_create_package!'
Namespace::PackageSetting Load (0.3ms) SELECT "namespace_package_settings".* FROM "namespace_package_settings" WHERE "namespace_package_settings"."namespace_id" = 1 LIMIT 1
↳ app/models/namespace.rb:245:in `package_settings'
Packages::Blob Load (0.7ms) SELECT "packages_blobs".* FROM "packages_blobs" WHERE "packages_blobs"."id" = 33 LIMIT 1
↳ config/initializers/active_storage.rb:111:in `find_or_build_blob'
Packages::PackageFile Create (0.5ms) INSERT INTO "packages_package_files" ("package_id", "created_at", "updated_at", "size", "file_name", "file") VALUES (813, '2022-01-24 13:43:14.451751', '2022-01-24 13:43:14.451751', 0, 'file.txt', 'file.txt') RETURNING "id"
↳ app/services/packages/create_package_file_service.rb:26:in `execute'
Packages::PackageFile Update (0.4ms) UPDATE "packages_package_files" SET "file_store" = 1 WHERE "packages_package_files"."id" = 1529
↳ app/models/concerns/file_store_mounter.rb:17:in `update_file_store'
Packages::Blob Load (0.5ms) SELECT "packages_blobs".* FROM "packages_blobs" INNER JOIN "packages_attachments" ON "packages_blobs"."id" = "packages_attachments"."blob_id" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
↳ lib/gem_extensions/active_record/associations/has_one_through_association.rb:12:in `find_target'
Packages::Attachment Load (0.4ms) SELECT "packages_attachments".* FROM "packages_attachments" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
↳ config/initializers/active_storage.rb:64:in `block in has_one_file_attached'
Packages::Blob Update (0.4ms) UPDATE "packages_blobs" SET "content_type" = 'application/octet-stream', "metadata" = '{"identified":true}' WHERE "packages_blobs"."id" = 33
↳ lib/gitlab/database.rb:265:in `block in transaction'
Packages::Attachment Create (0.4ms) INSERT INTO "packages_attachments" ("name", "record_type", "record_id", "blob_id", "created_at") VALUES ('active_storage_file', 'Packages::PackageFile', 1529, 33, '2022-01-24 13:43:14.487185') RETURNING "id"
↳ lib/gitlab/database.rb:265:in `block in transaction'
Packages::PackageFile Update (0.6ms) UPDATE "packages_package_files" SET "updated_at" = '2022-01-24 13:43:14.492606' WHERE "packages_package_files"."id" = 1529
↳ lib/gitlab/database.rb:265:in `block in transaction'
TRANSACTION (0.7ms) COMMIT
↳ lib/gitlab/database.rb:298:in `commit'
TRANSACTION (0.3ms) BEGIN
↳ config/initializers/forbid_sidekiq_in_transactions.rb:58:in `block in committed!'
Packages::Blob Update (0.4ms) UPDATE "packages_blobs" SET "metadata" = '{"identified":true,"analyzed":true}' WHERE "packages_blobs"."id" = 33
↳ config/initializers/forbid_sidekiq_in_transactions.rb:58:in `block in committed!'
TRANSACTION (0.4ms) COMMIT
↳ lib/gitlab/database.rb:298:in `commit'
🔽 Download
$ curl -L --header "PRIVATE-TOKEN: <PAT>" "http://gdk.test:8000/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt"
bananas!
Rails logs (truncated):
Started GET "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" for 172.16.123.1 at 2022-01-24 14:49:43 +0100
[snip snip]
Packages::Package Load (0.6ms) SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."project_id" = 511 AND "packages_packages"."package_type" = 7 AND "packages_packages"."status" = 0 AND "packages_packages"."name" = 'as_experiment' AND "packages_packages"."version" = '1.3.7' LIMIT 1
↳ app/models/packages/package.rb:192:in `by_name_and_version!'
Packages::PackageFile Load (0.7ms) SELECT "packages_package_files".* FROM "packages_package_files" WHERE "packages_package_files"."package_id" = 813 AND "packages_package_files"."file_name" = 'file.txt' ORDER BY "packages_package_files"."id" DESC LIMIT 1
↳ app/finders/packages/package_file_finder.rb:16:in `execute!'
[snip snip]
Packages::Blob Load (0.6ms) SELECT "packages_blobs".* FROM "packages_blobs" INNER JOIN "packages_attachments" ON "packages_blobs"."id" = "packages_attachments"."blob_id" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
↳ lib/gem_extensions/active_record/associations/has_one_through_association.rb:12:in `find_target'
GitlabS3 Storage (1.8ms) Generated URL for file at key: b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage (http://127.0.0.1:9000/packages/b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage?response-content-disposition=attachment%3B%20filename%3D%22%22%3B%20filename%2A%3DUTF-8%27%27&response-content-type=application%2Foctet-stream&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20220124%2Fgdk%2Fs3%2Faws4_request&X-Amz-Date=20220124T134943Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=9967549d01d4cecc7544787785b64a5d883586383c47639965c05f4afa6d1b34)
💬 Discussion
💾 Object Storage interactions
Current situation:
- Uploading
- Workhorse: 2 (1 PUT, 1 DELETE)
- Rails: 4 (1 HEAD, 1 PUT (copy), 1 HEAD, 1 DELETE)
- Download (returning a redirect)
- Workhorse: 0
- Rails: 0
With ActiveStorage:
-
Uploading
- Workhorse: 1 PUT
- Rails: 0
-
Download (returning a redirect)
- Workhorse: 0
- Rails: 0
-
No changes for the download operation
-
6
->1
requests for the upload operation.- In particular, there is no PUT (copy) request anymore!
🎉
- In particular, there is no PUT (copy) request anymore!
-
Reducing the amount of interactions with object storage during the upload grants:
- Faster handling.
- Reduced overall cost as those requests are not free.
👍 Upsides
- The
finalize
call doesn't need to be handled by a middleware. It is now simplified to parameters handling: reading theblob_signed_id
parameter and act upon it. - The
#attach
api is really developer friendly. Example: pass a blob signed id to it and it will create the attachment for you if the object is persisted, otherwise it will record an attachment creation for when the business model is saved. - Easily compatible with our current Workhorse <-> Rails logic flows (
authorize
andfinalize
requests) - the
_blobs
table becomes the single source of truth for uploaded files:- if a file on object storage is not in this table, it shouldn't exist in object storage
- the responsibility of cleaning up dangling blobs (blobs that have no attachments) falls in rails: Workhorse can avoid sending DELETE requests (=simplification).
- The
authorize
call works with a reference (the blob signed id) = thefinalize
call doesn't need to pass all the information around the upload, passing the reference is enough = we can certainly not use a middleware anymore because passing the blob signed id is enough for rails to locate the upload during thefinalize
call.
- The previous point has an impact on specs: we don't need to test against all the possible forms of uploaded files. Testing against: 1. a valid blob signed id and 2. an invalid or absent blob signed id is enough to cover all the situations.
- Basically, developers can ignore how the blob was created and uploaded and fully focused on testing what happens when there is a valid and invalid blob signed id.
- The concept of a blob signed id is really valuable. Rails can protect blob id. This way, if Workhorse or
👿 users mess up with theblob_signed_id
, thefinalize
call will get rejected. - The approach of
blobs
andattachments
table is compatible with a background migration that will move the existing uploads to those tables. - ActiveStorage is being actively worked on as part of the rails framework.
- We can easily implement a "gitlab" version of the adapters (s3 service) to add additional options.
- Moving an uploaded file or removing it becomes trivial: simply update or remove the row from the
attachments
table.- This creates dangling
blobs
but we can easily imagine having a cleanup worker that will take care of those.
- This creates dangling
👎 Downsides
- We have a lot of overrides within ActiveStorage because of:
- variants that we will never really use.
- several references to
ActiveStorage::Attachment
andActiveStorage::Blob
.
- ActiveStorage seems to be geared towards "media" files such as images and videos whereas the bulk of uploads in GitLab is simply archives or binaries objects (zip or tar.gz files)
- There are several features are these "media" files such as analyzers and previewers that we don't really need for now. Disabling them for good means more overrides.
- Those overrides feel fragile or breakable easily.
- ActiveStorage macros
has_one_attached
andhas_many_attached
are still accessible. - The
attachments
table is using a polymorphic association which is not recommend by the database guidelines.
🎩 Conclusions
- Can we use ActiveStorage for GitLab uploads? That's a yes.
- If yes, what's the price? Several overrides. Some of them are quite deep within ActiveStorage.
I think that (2.) is a price to high to pay for the scale of GitLab. I don't think we can afford running ActiveStorage with several bandages around it to "bend" it GitLab needs. The result will not:
- look professional
- be reliable.
- Eg. bumping ActiveStorage or Rails could break the overrides.
Having said that, the blobs
+ attachments
approach is a really valuable and simple one. We should aim for that same simplicity.
I would thus suggest following #348959 with a custom implementation but take heavy inspirations from ActiveStorage:
- blob signed ids
- service/adapter approach to handle all the interactions with the different object storage backends (including FileStore).
- re-use the
attach
API as it is a really powerful one.
If one day, ActiveStorage provides optional variants features and different blobs/attachments tables, we can revisit if that jump is valuable.
🔮 Other considerations
While working on this, I noticed other improvements we could implement:
- Make the
authorize
response compatible with returning multiple blob signed ids. This way, workhorse could ask for two files at once (example: when handling CI artifacts uploads) - the object storage configuration is passed back to Workhorse in the
authorize
response. I've been wondering if we could pass the credentials too (in an encrypted manner).🤔 - This way, workhorse can use an official client to upload the file (just as it does with the consolidated configuration)