Content-Addressable Storage with Deduplication (#21037) · Epics · GitLab.org

Content-Addressable Storage with Deduplication

**Problem:** Without content-addressable storage, identical artifacts are stored multiple times, inflating costs. The current container registry uses instance-wide deduplication which creates expensive cross-partition GC operations. The current package registry has no deduplication at all. **Scope:** Implement the blob storage layer with organization-scoped content addressing and deduplication, following the hashed storage layout defined in the blueprint. **Key deliverables:** - `blob_storage_blobs` and `blob_storage_attachments` tables with organization-scoped unique constraints on SHA256/SHA1 - Hashed storage path layout: `artifacts/{org_hash_shard}/{org_hash}/objects/{obj_hash_shard}/{obj_hash}` - Two-phase upload flow (temporary upload → content-addressed final location) - Deduplication check on publish: verify digest exists within org before creating new blob - Reference counting via `blob_storage_attachments` for safe garbage collection - Dedicated `artifact_registry` object storage bucket configuration - Garbage collection worker: org-scoped, processes orphaned blobs after all attachments removed - Backend support for GCS, S3, and Azure Blob Storage **Acceptance criteria:** - Uploading identical content within an org results in a single physical blob with multiple attachments - Uploading identical content across orgs results in separate physical blobs (org isolation) - GC worker correctly identifies and removes orphaned blobs without impacting referenced content - Storage path structure matches the blueprint's hashed layout - Move/rename operations validated against all supported backends **Dependencies:** Object storage configuration, backend-specific adapter evaluation

epic