Object Storage: storing attachments without carrierwave
This is a proposal for the incremental removal of carrierwave from the product.
This is part of &7122 (closed)
Abstract
This proposal focus on fixing some of the most painful aspects of the current implementation, with an iterative approach. It will improve the product keeping as much as possible as it is today.
- It will introduce an internal API call for authorizing uploads, using the same API format as the one we utilize with direct_upload.
- Instead of wrapping known endpoint with a custom authorization API, each request will call the internal endpoint for every uploaded file
- workhorse will no longer take care of removing files from object storage, it will now be handled as a cronjob in sidekiq
This solution will solve the following problems:
- Decouple ActiveModel and object storage API (no callbacks) to avoid incidents like gitlab-com/gl-infra/production#6042 (closed)
- Offload data handling via Workhorse by default
- Decouple object path from logical file path (no more temporary location + copy)
- Backward compatibility with the current solution (requires opt-in for each model)
- Support multiple uploads into a single request
Technical description
The solution is broken down into three sections:
- Incoming uploads and workhorse
- Async cleanup
- Database models
Incoming uploads and workhorse
The upload is designed to minimize workhorse changes, a section of the codebase where we have fewer experts, and to focus the work on the rails side, where carrierwave is in use.
Here follow a sequence diagram of a generic incoming request that contains one or more files to upload.
sequenceDiagram
participant c as Client
participant w as Workhorse
participant r as Rails
participant os as Object Storage
participant pg as Postgres
note right of c: For every incoming request, regardless of its route
activate c
c ->>+w: POST /some/url/upload
loop for every attachment
w ->>+r: POST /internal/upload/authorize
Note right of r: the same endpoint for every request
Note over w,r: This API call sends the original request path<br> and the current file name (if available)
r->>r: Validate auth
note right of r: knowing the original request path allows for using feature buckets,<br> everything else can be uploaded on the uploads bucket
r->>pg: persist an object blob (path and other metadata)
r-->>-w: upload parametes (i.e. presigned OS URL)
w->>+os: PUT file
Note over w,os: file is stored on its final location,<br> according to the upload parameters provide by rails
os-->>-w: request result
end
w->>+r: POST /some/url/upload
Note over w,r: every file is replaced with its location<br>and other metadata
r->>pg: link object blobs and the expected model
r-->>-c: request result
deactivate c
As we can see, there is no longer a need for a custom authorize API, we can serve all the requests with only one. To do so we pass the original request URL and the current file name as parameters, so that we can switch buckets on the rails side.
Rails will record the requested upload on the DB, saving a blob entry. This will serve as our database reference of the object.
Workhorse will no longer delete uploaded files at the end of the request, as the upload destination is no longer temporary.
Async cleanup
Not all the uplods are successful, because of the temporary nature of the direct_upload path, workhorse used to delete uploads as the end of each request. It was rails, during the finalization call, that copied each incoming file into its final destination.
A new sidekiq cronjob will take care of pruning object storage, each failed upload will only have a blob row, but it will not be linked to any attachment. We can periodically search for those blob entries and delete them from object storage (and from the DB).
sequenceDiagram
participant r as Sidekiq
participant pg as Postgres
participant os as Object Storage
note right of r: Sidekiq cron job
activate r
r->>+pg: select object blobs not linked to a model,<br> older than 1 day
pg-->>-r: objects
loop for every object blob
r->>+os: DELETE object
os-->>-r: done
end
r->>pg: delete object blobs
deactivate r
Bonus point: we can now delete attachments from the database when an object is no longer needed, it will be correctly garbage collected on the next run of this cron job. Effectively decoupling database transactions from object storage API calls.
Database models
The database part is designed to detach the user-facing path of the file from its location on object storage.
- a
blobrepresents an object, it's never moved. - an
attachmentis the logical representation of the file, it links ablobwith any other model (polymorphic association). The user-facing path is stored here. - each model can contain 0 or more
attachments - an
attachmentis composed of exactly oneblob - a
blobcan be linked to 0 or moreattachments - a
blobis created in the internal authorize API and can only be deleted by the async cleanup cron job - a
blobwill be updated during the finalization API call, where workhorse provides metadata like the size, sha521, etc.. - an
attachmentis generated by the finalization API call, it can be deleted like any other model
erDiagram
A_MODEL ||--o{ ATTACHMENT : contains
ATTACHMENT {
string model_type
int model_id
string filename
}
ATTACHMENT }o--|| BLOB: is
BLOB {
int store
string path
datetime created_at
int size
string sha512
string sha256
string md5
}
Note: it may be worth exploring the use of active storage for the database. If it matches our needs, it may reduce the amount of code we have to maintain.