Skip to content

Draft: Proof of concept: RemoteBlob, variant 2

Jacob Vosmaer requested to merge jv-remote-blobs-2 into master

This is a proof of concept for how we may improve object storage in GitLab.

The new storage engine, called RemoteBlob, is designed so that object storage blobs have a stable path (key) throughout their lifecycle. Renames and deletion are implemented via a SQL table that translates paths used by the application to the stable paths in object storage.

Advantages

  • No more copy operations when finalizing a direct upload
  • Deleting an object only requires a SQL update, not an HTTP call to object storage
  • Existing blobs uploaded using CarrierWave::Storage::Fog can be back-filled into the new storage engine without having to copy or transfer blob objects
  • Because GitLab already supports varying the storage engine per upload, we can choose when and if we want to "migrate" existing data to the new storage engine. Even if we do not migrate old data the application will continue to handle mixed old and new data correctly

Disadvantages

  • We introduce a big SQL table remote_blobs that will contain one row per stored blob
  • Adding a new storage engine does not reduce the existing complexity of our monkey-patched version of CarrierWave

Open questions

  • How do we correctly integrate this with Geo in the case where Geo sites have their own local object storage buckets?
  • Do we want to store attributes other than the path in remote_blobs? For convenience, this MR stores blob sizes in SQL but we could omit this at the cost of extra HEAD requests to the storage backend. Conversely, we could cache additional object attributes in SQL.

Comparison to MR 91707

This is a design iteration on !91707 (closed). What changed in this iteration is that the object storage paths ("stable paths") are now plain strings. This makes it possible to back-fill existing blobs by using their paths as both the stable_path and path.

Edited by Jacob Vosmaer

Merge request reports