Skip to content

Draft: ActiveStorage experiment

David Fernandez requested to merge 10io-active-storage-investigation into master

Don't ever merge this MR. This is not a production ready change

Code quality is low on purpose, see the Scope section

🌚 Context

In #348959, we opened the discussion about removing the dependency to CarrierWave for file uploads.

The idea revolves around two tables:

  • blobs : this table holds the information on the file upload and more importantly, the key under which is the file is stored (in object storage or the file system)
  • attachments: this is simply the link between a single blob and whatever business model makes sense (polymorphic assocation)

This approach is really similar to what ActiveStorage uses.

This led us to ask ourselves:

  • Can we use ActiveStorage for GitLab uploads?
    • If yes, what's the price?
    • If no, what are the blocking points?

🔭 Scope

The scope of this experiment is highly tight due to the amount of time I have at my disposal for this.

We're going to try adding ActiveStorage support for the generic packages registry. That registry works solely with body uploads. We're going to focus on two user actions:

  1. Uploading a generic package to GitLab.
  2. Download the generic package from GitLab.

To simplify things, we're going to use these settings:

  • GitLab connected to an S3 object storage (minio)
  • Non consolidated object storage configuration.

GDK config:

object_store:
  enabled: true
  consolidated_form: false

Additional restrictions:

  • blobs and attachments tables should be prefixable. In our case, we should save rows in packages_blobs and packages_attachments.

Additional remarks:

  • The code changes are not meant to be production ready.
  • The code changes will use several shortcuts.

🗒 Summary of changes

🗄 Active Storage integration

  • Enable the active_storage framework
  • Configure it so that (initializer)
    • ActiveStorage routes are not added to the application router
    • We don't have any previewers and analyzers
    • Generate the service configurations from the GitLab config
  • "Disable" variants
  • Adds a new macro: has_one_file_attached that describes one file attachment on whatever name
  • Overrides methods in ActiveStorage::Attached::Changes::CreateOne to not directly reference ActiveStorage::Attachment or ActiveStorage::Blob
  • Adds a migrations for tables packages_blobs and packages_attachments
  • Adds a custom ActiveStorage::Service for S3
    • Overrides the pre signed upload url generation to remove content_type and content_size signing.

🐎 Workhorse changes

  • Update workhose_authorize so that the blob class can be passed.
  • Adds a ActiveStorage::DirectUpload class to handle blobs creation.
  • Handles the BlobSignedId parameter during the authorize and finalize calls.

📦 Packages Registry changes

  • On the authorize handler, pass the blob class.
  • On the finalize handler, read the blob signed id.
  • In the package file creation service, accept a blob signed id and call attach with it.
  • In the endpoint that returns the package, simply redirect to the signed url of object storage

🍿 Demo

🔼 Upload

$ less dummy.txt
bananas!

$ curl --header "PRIVATE-TOKEN: <PAT>" --upload-file ./dummy.txt \ 
     "http://gdk.test:8000/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt"
{"message":"201 Created"}

Rails logs (truncated):

Started PUT "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt/authorize" for 127.0.0.1 at 2022-01-24 14:43:13 +0100

[snip snip]

  TRANSACTION (0.3ms)  BEGIN
  ↳ app/models/packages/blob.rb:13:in `create_before_direct_upload!'
  Packages::Blob Create (1.3ms)  INSERT INTO "packages_blobs" ("key", "filename", "content_type", "service_name", "created_at") VALUES ('b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage', '', '', 'packages', '2022-01-24 13:43:13.762835') RETURNING "id" 
  ↳ app/models/packages/blob.rb:13:in `create_before_direct_upload!'
  TRANSACTION (0.5ms)  COMMIT 
  ↳ lib/gitlab/database.rb:298:in `commit'
  GitlabS3 Storage (1.6ms) Generated URL for file at key: b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage (http://127.0.0.1:9000/packages/b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20220124%2Fgdk%2Fs3%2Faws4_request&X-Amz-Date=20220124T134313Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=2df2af71843dd71ac59c41d01fa375b080b3c6e2c8cd4767acc76150ba4ff43d)

Started PUT "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" for 172.16.123.1 at 2022-01-24 14:43:13 +0100

[snip snip]

  TRANSACTION (0.3ms)  BEGIN 
  ↳ app/services/packages/create_package_service.rb:14:in `find_or_create_package!'
  Packages::Package Load (0.7ms)  SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."project_id" = 511 AND "packages_packages"."package_type" = 7 AND "packages_packages"."name" = 'as_experiment' AND "packages_packages"."version" = '1.3.7' LIMIT 1 
  ↳ app/services/packages/create_package_service.rb:14:in `find_or_create_package!'
  Namespace::PackageSetting Load (0.3ms)  SELECT "namespace_package_settings".* FROM "namespace_package_settings" WHERE "namespace_package_settings"."namespace_id" = 1 LIMIT 1 
  ↳ app/models/namespace.rb:245:in `package_settings'
  Packages::Blob Load (0.7ms)  SELECT "packages_blobs".* FROM "packages_blobs" WHERE "packages_blobs"."id" = 33 LIMIT 1 
  ↳ config/initializers/active_storage.rb:111:in `find_or_build_blob'
  Packages::PackageFile Create (0.5ms)  INSERT INTO "packages_package_files" ("package_id", "created_at", "updated_at", "size", "file_name", "file") VALUES (813, '2022-01-24 13:43:14.451751', '2022-01-24 13:43:14.451751', 0, 'file.txt', 'file.txt') RETURNING "id" 
  ↳ app/services/packages/create_package_file_service.rb:26:in `execute'
  Packages::PackageFile Update (0.4ms)  UPDATE "packages_package_files" SET "file_store" = 1 WHERE "packages_package_files"."id" = 1529 
  ↳ app/models/concerns/file_store_mounter.rb:17:in `update_file_store'
  Packages::Blob Load (0.5ms)  SELECT "packages_blobs".* FROM "packages_blobs" INNER JOIN "packages_attachments" ON "packages_blobs"."id" = "packages_attachments"."blob_id" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1 
  ↳ lib/gem_extensions/active_record/associations/has_one_through_association.rb:12:in `find_target'
  Packages::Attachment Load (0.4ms)  SELECT "packages_attachments".* FROM "packages_attachments" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
  ↳ config/initializers/active_storage.rb:64:in `block in has_one_file_attached'
  Packages::Blob Update (0.4ms)  UPDATE "packages_blobs" SET "content_type" = 'application/octet-stream', "metadata" = '{"identified":true}' WHERE "packages_blobs"."id" = 33
  ↳ lib/gitlab/database.rb:265:in `block in transaction'
  Packages::Attachment Create (0.4ms)  INSERT INTO "packages_attachments" ("name", "record_type", "record_id", "blob_id", "created_at") VALUES ('active_storage_file', 'Packages::PackageFile', 1529, 33, '2022-01-24 13:43:14.487185') RETURNING "id"
  ↳ lib/gitlab/database.rb:265:in `block in transaction'
  Packages::PackageFile Update (0.6ms)  UPDATE "packages_package_files" SET "updated_at" = '2022-01-24 13:43:14.492606' WHERE "packages_package_files"."id" = 1529 
  ↳ lib/gitlab/database.rb:265:in `block in transaction'
  TRANSACTION (0.7ms)  COMMIT 
  ↳ lib/gitlab/database.rb:298:in `commit'
  TRANSACTION (0.3ms)  BEGIN 
  ↳ config/initializers/forbid_sidekiq_in_transactions.rb:58:in `block in committed!'
  Packages::Blob Update (0.4ms)  UPDATE "packages_blobs" SET "metadata" = '{"identified":true,"analyzed":true}' WHERE "packages_blobs"."id" = 33 
  ↳ config/initializers/forbid_sidekiq_in_transactions.rb:58:in `block in committed!'
  TRANSACTION (0.4ms)  COMMIT 
  ↳ lib/gitlab/database.rb:298:in `commit'

🔽 Download

$ curl -L --header "PRIVATE-TOKEN: <PAT>" "http://gdk.test:8000/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" 
bananas!

Rails logs (truncated):

Started GET "/api/v4/projects/511/packages/generic/as_experiment/1.3.7/file.txt" for 172.16.123.1 at 2022-01-24 14:49:43 +0100

[snip snip]

  Packages::Package Load (0.6ms)  SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."project_id" = 511 AND "packages_packages"."package_type" = 7 AND "packages_packages"."status" = 0 AND "packages_packages"."name" = 'as_experiment' AND "packages_packages"."version" = '1.3.7' LIMIT 1 
  ↳ app/models/packages/package.rb:192:in `by_name_and_version!'
  Packages::PackageFile Load (0.7ms)  SELECT "packages_package_files".* FROM "packages_package_files" WHERE "packages_package_files"."package_id" = 813 AND "packages_package_files"."file_name" = 'file.txt' ORDER BY "packages_package_files"."id" DESC LIMIT 1 
  ↳ app/finders/packages/package_file_finder.rb:16:in `execute!'
  
[snip snip]

  Packages::Blob Load (0.6ms)  SELECT "packages_blobs".* FROM "packages_blobs" INNER JOIN "packages_attachments" ON "packages_blobs"."id" = "packages_attachments"."blob_id" WHERE "packages_attachments"."record_id" = 1529 AND "packages_attachments"."record_type" = 'Packages::PackageFile' AND "packages_attachments"."name" = 'active_storage_file' LIMIT 1
  ↳ lib/gem_extensions/active_record/associations/has_one_through_association.rb:12:in `find_target'
  GitlabS3 Storage (1.8ms) Generated URL for file at key: b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage (http://127.0.0.1:9000/packages/b2/c7/b2c7c3913bc37ed0f643f40a3a154ff41f62d9df42bb7b81ff8a155cf66851f3/905nj0gf7begaa81ryrfoj9wsage?response-content-disposition=attachment%3B%20filename%3D%22%22%3B%20filename%2A%3DUTF-8%27%27&response-content-type=application%2Foctet-stream&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20220124%2Fgdk%2Fs3%2Faws4_request&X-Amz-Date=20220124T134943Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=9967549d01d4cecc7544787785b64a5d883586383c47639965c05f4afa6d1b34)

💬 Discussion

💾 Object Storage interactions

Current situation:

  • Uploading
    • Workhorse: 2 (1 PUT, 1 DELETE)
    • Rails: 4 (1 HEAD, 1 PUT (copy), 1 HEAD, 1 DELETE)
  • Download (returning a redirect)
    • Workhorse: 0
    • Rails: 0

With ActiveStorage:

  • Uploading

    • Workhorse: 1 PUT
    • Rails: 0
  • Download (returning a redirect)

    • Workhorse: 0
    • Rails: 0
  • No changes for the download operation

  • 6 -> 1 requests for the upload operation.

    • In particular, there is no PUT (copy) request anymore! 🎉
  • Reducing the amount of interactions with object storage during the upload grants:

    • Faster handling.
    • Reduced overall cost as those requests are not free.

👍 Upsides

  • The finalize call doesn't need to be handled by a middleware. It is now simplified to parameters handling: reading the blob_signed_id parameter and act upon it.
  • The #attach api is really developer friendly. Example: pass a blob signed id to it and it will create the attachment for you if the object is persisted, otherwise it will record an attachment creation for when the business model is saved.
  • Easily compatible with our current Workhorse <-> Rails logic flows (authorize and finalize requests)
  • the _blobs table becomes the single source of truth for uploaded files:
    • if a file on object storage is not in this table, it shouldn't exist in object storage
    • the responsibility of cleaning up dangling blobs (blobs that have no attachments) falls in rails: Workhorse can avoid sending DELETE requests (=simplification).
    • The authorize call works with a reference (the blob signed id) = the finalize call doesn't need to pass all the information around the upload, passing the reference is enough = we can certainly not use a middleware anymore because passing the blob signed id is enough for rails to locate the upload during the finalize call.
  • The previous point has an impact on specs: we don't need to test against all the possible forms of uploaded files. Testing against: 1. a valid blob signed id and 2. an invalid or absent blob signed id is enough to cover all the situations.
    • Basically, developers can ignore how the blob was created and uploaded and fully focused on testing what happens when there is a valid and invalid blob signed id.
  • The concept of a blob signed id is really valuable. Rails can protect blob id. This way, if Workhorse or 👿 users mess up with the blob_signed_id, the finalize call will get rejected.
  • The approach of blobs and attachments table is compatible with a background migration that will move the existing uploads to those tables.
  • ActiveStorage is being actively worked on as part of the rails framework.
  • We can easily implement a "gitlab" version of the adapters (s3 service) to add additional options.
  • Moving an uploaded file or removing it becomes trivial: simply update or remove the row from the attachments table.
    • This creates dangling blobs but we can easily imagine having a cleanup worker that will take care of those.

👎 Downsides

  • We have a lot of overrides within ActiveStorage because of:
    • variants that we will never really use.
    • several references to ActiveStorage::Attachment and ActiveStorage::Blob.
  • ActiveStorage seems to be geared towards "media" files such as images and videos whereas the bulk of uploads in GitLab is simply archives or binaries objects (zip or tar.gz files)
    • There are several features are these "media" files such as analyzers and previewers that we don't really need for now. Disabling them for good means more overrides.
  • Those overrides feel fragile or breakable easily.
  • ActiveStorage macros has_one_attached and has_many_attached are still accessible.
  • The attachments table is using a polymorphic association which is not recommend by the database guidelines.

🎩 Conclusions

  1. Can we use ActiveStorage for GitLab uploads? That's a yes.
  2. If yes, what's the price? Several overrides. Some of them are quite deep within ActiveStorage.

I think that (2.) is a price to high to pay for the scale of GitLab. I don't think we can afford running ActiveStorage with several bandages around it to "bend" it GitLab needs. The result will not:

  • look professional
  • be reliable.
    • Eg. bumping ActiveStorage or Rails could break the overrides.

Having said that, the blobs + attachments approach is a really valuable and simple one. We should aim for that same simplicity.

I would thus suggest following #348959 with a custom implementation but take heavy inspirations from ActiveStorage:

  • blob signed ids
  • service/adapter approach to handle all the interactions with the different object storage backends (including FileStore).
  • re-use the attach API as it is a really powerful one.

If one day, ActiveStorage provides optional variants features and different blobs/attachments tables, we can revisit if that jump is valuable.

🔮 Other considerations

While working on this, I noticed other improvements we could implement:

  • Make the authorize response compatible with returning multiple blob signed ids. This way, workhorse could ask for two files at once (example: when handling CI artifacts uploads)
  • the object storage configuration is passed back to Workhorse in the authorize response. I've been wondering if we could pass the credentials too (in an encrypted manner). 🤔
    • This way, workhorse can use an official client to upload the file (just as it does with the consolidated configuration)
Edited by David Fernandez

Merge request reports