Object Storage: storing attachments without carrierwave

This is a proposal for the incremental removal of carrierwave from the product.

Abstract

This proposal focus on fixing some of the most painful aspects of the current implementation, with an iterative approach. It will improve the product keeping as much as possible as it is today.

It will introduce an internal API call for authorizing uploads, using the same API format as the one we utilize with direct_upload.
Instead of wrapping known endpoint with a custom authorization API, each request will call the internal endpoint for every uploaded file
workhorse will no longer take care of removing files from object storage, it will now be handled as a cronjob in sidekiq

This solution will solve the following problems:

Decouple ActiveModel and object storage API (no callbacks) to avoid incidents like gitlab-com/gl-infra/production#6042 (closed)
Offload data handling via Workhorse by default
Decouple object path from logical file path (no more temporary location + copy)
Backward compatibility with the current solution (requires opt-in for each model)
Support multiple uploads into a single request

Technical description

The solution is broken down into three sections:

Incoming uploads and workhorse
Async cleanup
Database models

Incoming uploads and workhorse

The upload is designed to minimize workhorse changes, a section of the codebase where we have fewer experts, and to focus the work on the rails side, where carrierwave is in use.

Here follow a sequence diagram of a generic incoming request that contains one or more files to upload.

sequenceDiagram
    participant c as Client
    participant w as Workhorse
    participant r as Rails
    participant os as Object Storage
    participant pg as Postgres

    note right of c: For every incoming request, regardless of its route

    activate c
    c ->>+w: POST /some/url/upload

    loop for every attachment 

    w ->>+r: POST /internal/upload/authorize
    Note right of r: the same endpoint for every request
    Note over w,r: This API call sends the original request path<br> and the current file name (if available)
    r->>r: Validate auth
    note right of r: knowing the original request path allows for using feature buckets,<br> everything else can be uploaded on the uploads bucket
    r->>pg: persist an object blob (path and other metadata)
    
    
    r-->>-w: upload parametes (i.e. presigned OS URL)

    w->>+os: PUT file
    Note over w,os: file is stored on its final location,<br> according to the upload parameters provide by rails
    os-->>-w: request result

    end

    w->>+r:  POST /some/url/upload
    Note over w,r: every file is replaced with its location<br>and other metadata

    r->>pg: link object blobs and the expected model

    r-->>-c: request result
    deactivate c

As we can see, there is no longer a need for a custom authorize API, we can serve all the requests with only one. To do so we pass the original request URL and the current file name as parameters, so that we can switch buckets on the rails side.

Rails will record the requested upload on the DB, saving a blob entry. This will serve as our database reference of the object.

Workhorse will no longer delete uploaded files at the end of the request, as the upload destination is no longer temporary.

Async cleanup

Not all the uplods are successful, because of the temporary nature of the direct_upload path, workhorse used to delete uploads as the end of each request. It was rails, during the finalization call, that copied each incoming file into its final destination.

A new sidekiq cronjob will take care of pruning object storage, each failed upload will only have a blob row, but it will not be linked to any attachment. We can periodically search for those blob entries and delete them from object storage (and from the DB).

sequenceDiagram
    participant r as Sidekiq
    participant pg as Postgres
    participant os as Object Storage

    note right of r: Sidekiq cron job
    activate r
    r->>+pg: select object blobs not linked to a model,<br> older than 1 day
    pg-->>-r: objects

    loop for every object blob 
        r->>+os: DELETE object
        os-->>-r: done    
    end

    r->>pg: delete object blobs
    deactivate r

Bonus point: we can now delete attachments from the database when an object is no longer needed, it will be correctly garbage collected on the next run of this cron job. Effectively decoupling database transactions from object storage API calls.

Database models

The database part is designed to detach the user-facing path of the file from its location on object storage.

a blob represents an object, it's never moved.
an attachment is the logical representation of the file, it links a blob with any other model (polymorphic association). The user-facing path is stored here.
each model can contain 0 or more attachments
an attachment is composed of exactly one blob
a blob can be linked to 0 or more attachments
a blob is created in the internal authorize API and can only be deleted by the async cleanup cron job
a blob will be updated during the finalization API call, where workhorse provides metadata like the size, sha521, etc..
an attachment is generated by the finalization API call, it can be deleted like any other model

erDiagram
    A_MODEL ||--o{ ATTACHMENT : contains
    ATTACHMENT {
        string model_type
        int model_id
        string filename
    }

    ATTACHMENT }o--|| BLOB: is
    BLOB {
        int store
        string path
        datetime created_at
        int size
        string sha512
        string sha256
        string md5
    }

Note: it may be worth exploring the use of active storage for the database. If it matches our needs, it may reduce the amount of code we have to maintain.

Edited Dec 20, 2021 by Alessio Caiazza