Commit 69b283ef authored by João Pereira's avatar João Pereira 🌴 Committed by Steve Xuereb
Browse files

Architecture: Artifact Registry

parent 1c6dc017
Loading
Loading
Loading
Loading
+442 −0

File added.

Preview size limit exceeded, changes collapsed.

+111 −0
Original line number Diff line number Diff line
---
title: "Artifact Registry ADR 001: Organizations as Anchor Point"
owning-stage: "~devops::package"
description: "Decision to anchor the Artifact Registry to Organizations"
toc_hide: true
---

<!-- Design Documents often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->

## Context

The Artifact Registry requires a GitLab entity to serve as its anchor point - the primary boundary for registry instances, storage, cost attribution, and access control. This foundational decision affects every other architectural choice in the system.

Two candidates exist:

1. **[Organizations](https://docs.gitlab.com/user/organization/)**: A new GitLab entity designed as the top-level container for groups and projects, intended to replace top-level groups as the primary organizational boundary
2. **Top-level groups**: The highest-level group entity currently available in GitLab, serving as the root of the group hierarchy

Organizations align perfectly with the enterprise use case: a single entity representing an entire company or business unit, containing all its groups, projects, and now artifacts.

### Organizations Availability

Organizations are technically available but not yet GA. After consultation with the [Organizations team](/handbook/engineering/infrastructure-platforms/tenant-scale/organizations/), we have updated timeline information:

1. **Availability**: Organizations are technically available across GitLab installation types
2. **GA timeline**: Organizations are expected to launch within a year with a customer opt-in approach
3. **Strategic direction**: Organizations represent GitLab's long-term direction for enterprise organizational boundaries

Building for Organizations from the start provides significant advantages:

1. **No future migration burden**: Customers won't need to restructure from top-level groups to Organizations later
2. **Alignment with GitLab direction**: The Artifact Registry will be ready as Organizations adoption grows
3. **Conceptual correctness**: Organizations represent the natural boundary for enterprise artifact management

## Decision

**The Artifact Registry will anchor exclusively to Organizations.**

The feature will only be available at the Organization level - not at both Organization and top-level group levels. This avoids the complexity of maintaining dual implementations and ensures a single, clear architectural path.

This means:

- All repositories, artifacts, lifecycle policies, and access controls belong to an Organization
- Storage and cost attribution are calculated at the Organization boundary
- Deduplication is scoped to the Organization (see [ADR-002](002_storage_deduplication_scope.md))
- Organizations serve as the isolation and sharding boundary

## Consequences

### Positive

1. **Aligned with GitLab's target organizational model**: Building for Organizations from day one means no architectural pivots later
2. **No future migration burden**: Customers adopt the target architecture immediately, avoiding disruptive migrations
3. **Conceptually correct anchor point**: Organizations represent entire companies or business units - the natural boundary for enterprise artifact management
4. **Clean enterprise mapping**: One Organization equals one artifact registry instance, simplifying customer mental models
5. **Forward-compatible with Cells architecture**: Organizations are the intended sharding boundary for Cells, ensuring alignment
6. **Improved storage deduplication**: Organizations sit above top-level groups in the hierarchy. Identical blobs across multiple top-level groups within the same Organization are deduplicated, providing greater storage efficiency compared to top-level-group-scoped deduplication

### Negative

1. **Customer opt-in required**: During the Organizations rollout period, customers must explicitly enable Organizations
2. **Dependency on Organizations timeline**: Registry availability coordinates with Organizations readiness across installation types

## Implications by Installation Type

### GitLab.com (SaaS)

Customers create or use an Organization for their artifact management. This aligns with GitLab.com's direction toward Organizations as the primary organizational boundary.

### Self-Managed and Dedicated

Organizations will be available across all installation types. Customers enable Organizations to use the Artifact Registry. This provides a consistent experience across all GitLab deployment models.

## Blob Storage Deduplication

Anchoring to Organizations has direct implications for storage deduplication (see [ADR-002](002_storage_deduplication_scope.md) for detailed deduplication design).

Organizations sit above top-level groups in the GitLab hierarchy. By scoping deduplication to Organizations rather than top-level groups:

- **Broader deduplication scope**: Identical blobs across multiple top-level groups within the same Organization are stored only once
- **Greater storage efficiency**: Common artifacts (base images, shared libraries, public packages) used across different top-level groups benefit from deduplication
- **Improved cost attribution**: Storage costs are calculated at the Organization level, providing clearer billing boundaries

## Alternatives Considered

### Alternative: Top-Level Groups

#### Approach

Anchor the registry to top-level groups - the highest-level group entity currently available - with a plan to migrate to Organizations once they become stable and widely adopted.

**Note**: Top-level groups were considered during early planning when Organizations timeline had more uncertainty. The updated Organizations timeline and customer opt-in approach make building for the target architecture viable.

#### Why Not Chosen

1. **Future migration burden**: Customers would need to migrate from top-level groups to Organizations later, requiring careful planning and execution
2. **Conceptual mismatch**: Top-level groups are hierarchical containers, while the artifact registry conceptually serves an entire organization (which may span multiple top-level groups)
3. **Organizations timeline alignment**: Organizations are expected to launch within the Artifact Registry project timeline, making an interim approach unnecessary
4. **Avoiding precedent pitfalls**: Other features that chose top-level groups (Security Dashboard, Compliance Center, Value Stream Analytics) now face migration considerations; building for Organizations from the start avoids this pattern
5. **Reduced deduplication efficiency**: Top-level group scoped deduplication misses opportunities to deduplicate identical blobs across top-level groups within the same Organization
6. **Complex blob migration**: Migrating from top-level groups to Organizations would require consolidating blobs across top-level groups, adding significant migration complexity
7. **Consolidation not feasible**: Not all customers with multiple top-level groups (for example, GitLab itself uses `gitlab-org`, `gitlab-com`, etc.) can consolidate into a single group. Creating a separate top-level group exclusively for artifacts was considered but rejected due to the complexity of maintaining synchronized permissions across groups.

## References

- [Organizations Development Documentation](https://docs.gitlab.com/ee/development/organization/)
- [Organizations Team Handbook](/handbook/engineering/infrastructure-platforms/tenant-scale/organizations/)
- [Cells Design Document](../../cells/)
- [ADR-002: Storage Deduplication Scope](002_storage_deduplication_scope.md) - Deduplication boundary
<!-- - [ADR-007: Database Schema](007_database_schema.md) -->
+174 −0
Original line number Diff line number Diff line
---
title: "Artifact Registry ADR 002: Storage Deduplication Scope"
owning-stage: "~devops::package"
description: "Decision to scope storage deduplication to individual organizations rather than instance-wide"
toc_hide: true
---

<!-- Design Documents often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->

## Context

Content-addressable storage <!-- (see [ADR-008](008_content_addressable_storage.md)) --> deduplicates identical content. We must choose the deduplication level: instance-wide, organization, top-level group, or repository.

This choice affects:

- **Cost attribution**: How easily we calculate and bill storage at the selected level
- **Performance**: Garbage collection and query complexity
- **Security**: Whether shared blobs leak information across boundaries
- **Operations**: Disaster recovery, backups, and cross-boundary coordination

### Current State

The container registry uses instance-wide deduplication and suffers [operational problems](https://gitlab.com/gitlab-org/container-registry/-/issues/1242):

- Cross-partition queries for cost calculations
- Garbage collection blocked by cross-namespace dependencies
- Storage costs difficult to attribute to specific organizations

The Package registry has no deduplication.

### Analysis

We evaluated three boundaries: instance-wide, top-level group, and repository.

Note: this analysis was carried out when the top-level group was the considered level for object storage deduplication. Since then, the selected [anchor level is organizations](001_organizations_as_anchor_point.md). Therefore, deduplication will be scoped to organizations.

**Container Registry** ([analysis](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17524#note_3023542021)):

| Deduplication scope | Total storage | Overhead vs instance-wide |
| ------------------- | ------------- | ------------------------- |
| Instance-wide | 13 PB | baseline |
| Top-level group | 13.5 PB | +4% (+530 TB) |
| Repository | ~17 PB | +36% (~4 PB) |

At least ~95% of blobs appear in exactly one top-level group. Only ~5% span multiple groups. Top-level group deduplication adds only 4% storage overhead compared to instance-wide, while repository-level loses significant benefit (+36%).

**Maven Package Registry** ([analysis](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17524#note_3023014429)):

| Deduplication scope | Storage savings (deduplication benefit) |
| ------------------- | --------------------------------------- |
| Instance-wide | 3.68% |
| Top-level group | 3.62% |
| Repository | 2.11% |

Top-level group deduplication captures nearly the same benefit as instance-wide (3.62% vs 3.68%). The difference is only 0.06 percentage points. Repository-level loses almost half the benefit (2.11% vs 3.68%).

**Maven Virtual Registry** ([analysis](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17524#note_3027109326)):

| Metric | Value |
| ------ | ----- |
| Cache entries analyzed | 161,832 |
| Cache storage | ~44 GB |
| Duplicated with Package Registry (same group) | ~8-9% |

~92% of Virtual Registry cache entries are external dependencies not found in the same group's Package Registry. This confirms Virtual Registries primarily cache upstream content rather than content already stored locally. The Maven Virtual Registry is in beta with limited adoption, so these figures have lower confidence.

**Conclusion**: As we can see, the top-level group scope captures nearly all deduplication benefit (96%+ for containers, 98%+ for Maven) while avoiding cross-group complexity. The small additional storage overhead is an acceptable tradeoff for simpler operations, clearer cost attribution, and security isolation. Organization-level scope is closer to instance-wide, which provides the most storage-efficient deduplication.

## Decision

**Scope deduplication to individual organizations.**

Identical content (same SHA256 hash) within an organization stores once. Identical content across different organizations stores separately in each.

This applies to all artifact types (Docker images, Maven packages, npm modules) and all content (container layers, package files, binary blobs).

## Consequences

### Positive

1. **Clear cost attribution**: Sum blob sizes per organization. No cross-level calculations.
2. **Fair customer billing**: Customers pay once per unique blob.
3. **Predictable performance**: Operations (queries, GC, backups) stay within one organization. No cross-level dependencies.
4. **Security isolation**: Organizations cannot reference other organizations' blobs. No information leakage.
5. **Simpler garbage collection**: GC checks references within one organization only. No cross-level coordination.
6. **Self-contained disaster recovery**: Restore one organization without touching others.
7. **Independent scaling**: Each organization scales storage independently.
8. **Best deduplication level across platforms**: GitLab.com will use organizations. Dedicated and self-managed, since they will operate under a single organization, will effectively have instance-level deduplication.

### Negative

1. **Cross-organization duplication**: Popular content (base images, public packages) stores separately in each organization.
2. **Higher total instance storage**: Organization scope adds overhead vs instance-wide. This does *not* translate to an extra net expense as storage is billed to customers.
3. **Broader scope than repository-level**: Storage usage must track deduplicated blobs across repositories. GC must operate across all repositories, handling concurrent operations on shared blobs.

## Alternatives

### Alternative 1: Instance-Wide Deduplication

Store identical content once per instance-wide scope, regardless of which organization uploads it.

**Pros:**

- Minimum overall storage cost
- Fewer physical objects in object storage

**Cons:**

- **Cost attribution requires cross-organization algorithms**: Reference counting across all organizations to calculate each organization's usage
- **GC requires cross-organization coordination**: Deleting content forces checking all organizations for remaining references
- **Unbounded query scope**: Operations may cascade across partitions unpredictably
- **Information leakage risk**: Shared references reveal what content other organizations use
- **Disaster recovery couples organizations**: Restoring one organization may require content from others
- **First uploader subsidizes others**: Organizations uploading popular content first pay for everyone

**Why rejected:**

- Analysis shows only low single-digit savings over top-level group scope (container ~4%; Maven is much smaller)
- Container registry's instance-wide approach proved [operationally expensive](https://gitlab.com/gitlab-org/container-registry/-/issues/1242)
- Security risk from cross-organization blob sharing

### Alternative 2: Top-Level Group Deduplication

**Pros:**

- Reasonable overhead vs instance-level.
- Already a well-established entity in GitLab.

**Cons:**

- **Storage increase** over organization scope
- **Roll-up usage metrics** required for the organization level
- **Doesn't map naturally** to the [anchor level](001_organizations_as_anchor_point.md) of the feature, which is organization

**Why rejected:**

- Potential savings loss compared to organization-level scope
- Cognitive complexity from mixing deduplication scope with other features that operate at the organization level
- Roll-up metrics pose a challenge at scale. [Past experience with namespace statistics](https://gitlab.com/groups/gitlab-org/-/work_items/8627)

### Alternative 3: Repository-Scoped Deduplication

Deduplicate only within a single repository.

**Pros:**

- Strongest isolation (no cross-repository references)
- Simplest GC and cost attribution

**Cons:**

- **~36% storage increase** over top-level group scope
- **Unfair customer billing**: Customers charged multiple times for identical content across repositories

**Why rejected:**

- Analysis shows substantial savings loss at repository scope
- Billing customers repeatedly for the same blob is likely unacceptable
- Organization scope provides better cost-benefit balance

## Implementation Notes

1. Content-addressable storage with SHA256 hashing <!-- (see [ADR-008](008_content_addressable_storage.md)) -->
2. Reference tracking scoped to individual organizations
3. Cost calculation: sum blob sizes per organization

## References

- [Container Registry Deduplication Analysis](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17524#note_3023542021)
- [Maven Package Registry Deduplication Analysis](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17524#note_3023014429)
- [Maven Virtual Registry Cache Analysis](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17524#note_3027109326)
<!-- - [ADR-008: Content-Addressable Storage](008_content_addressable_storage.md) -->
- [Container Registry Deduplication Complexity](https://gitlab.com/gitlab-org/container-registry/-/issues/1242)
+165 −0

File added.

Preview size limit exceeded, changes collapsed.