Investigation: Consider Using Object Lifecycle Management Instead of Native Upload Purging for GitLab.com

Context

Upload Purging

Problem

Upload purging as-is consumes a lot of memory and scales poorly. Fixing both of these issues would require significant development work, but as we delay in resolving this issue, outstanding uploads will continue to accrue on GitLab.com worsening the issue in terms of storage cost and in terms of an application based solution needing to deal with a large backlog.

Solution

Object Lifecycle Management rules could be used to target and delete objects related to upload purging for GitLab.com only. This approach should require much less work than altering application code and should enable most if not all the benefits of upload purging (reduced costs) and prevent a large backlog of old objects before an in-app solution can be developed.

Upload Purging Structure

All upload objects are located under an _uploads prefix within repository scoped paths

Structure
//    <root>/v2
//      -> repositories/
//        -><name>/
//          -> _uploads/<id>
//            data
//            startedat
//            hashstates/<algorithm>/<offset>
Example
└── repositories
    ├── repo-path
    │   └── _uploads
    │       ├── 60d6fc08-f969-4cd4-b9ee-2d7c749c9b67
    │       │   ├── data
    │       │   ├── hashstates
    │       │   │   └── sha256
    │       │   │       └── 0
    │       │   └── startedat
    │       └── e77ebe08-ac42-4197-ab57-41b8d0920509
    │           ├── data
    │           ├── hashstates
    │           │   └── sha256
    │           │       └── 0
    │           └── startedat

Upload purging reads the startedat file to determine the date of the upload and if it's older than the configured age parameter, the ID directory and its contents are deleted.

Given this, we should be able to replicate this behavior using the delete action when objects match an appropriate age, prefix (<root>/v2/repositories/). Because the prefix rule matches the entire prefix and does not support regular expressions, this rule will technically apply to all metadata objects. We do not use any other metadata on .com, but this will prevent us from being able to that metadata in the future for existing or new purposes.

Since uploads are stored in repository scoped directories, we're also able to roll this out incrementally to a limited extent by initially starting with a few test repositories. This doesn't scale well, and we can't do something as sophisticated as a percentage based rollout, it's more control than we normally have with registry's object storage.

Advantages

  • Enabling cleaning up outstanding upload data on GitLab.com
  • Low effort

Drawbacks

  • Less visibility than an equivalent application solution would have.
  • Another process altering registry data without directly communicating with the registry, this is mitigated somewhat by the fact that these upload files are intended to be temporary.
  • If application logic changes around upload logic, we'll have to remember to consider that the object lifecycle rules are in effect.
  • Only solves the issue for GitLab.com
  • We need to match the entire metadata prefix, which means we cannot turn metadata mirroring back on — we have no plans to do this, but it's worth mentioning.
Edited by Hayley Swimelar