Commit 48f43f46 authored by Rémy Coutable's avatar Rémy Coutable 🔴
Browse files

Add Organization lifecycle blueprint and ADR 009

parent c4417b09
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -91,6 +91,7 @@ Organization affects other parts of the system.
- [Billing](billing.md)
- [Cells](cells.md)
- [Settings](settings.md)
- [Lifecycle](lifecycle.md)
- [Users](users.md)
- [Login](login.md)
- [OAuth - GitLab as SP](oauth_client_auth.md)
+56 −0
Original line number Diff line number Diff line
---
owning-stage: "~devops::tenant scale"
title: 'Organizations ADR 009: State machine for organization lifecycle'
description: Why we use the state_machine gem backed by organizations.state and organization_details.state_metadata for the Organization lifecycle.
toc_hide: true
---

## Context

Organizations sit at the top of the resource hierarchy and own groups, projects, users, and settings. Their lifecycle needs explicit, machine-enforced control:

- An Organization must not be usable before confirmation.
- Deletion is two-tiered: reversible soft-delete for owners, irreversible hard-delete for admins.
- Every transition must be auditable (who, when, why — including the error on failures).
- Failed transitions must leave the row in a consistent, recoverable state.

The deletion workflow is tracked in [Add ability to delete an Organization](https://gitlab.com/groups/gitlab-org/-/work_items/21433).

## Decision

We manage the Organization lifecycle with the [`state_machine` gem](https://github.com/state-machines/state_machines), backed by:

- `organizations.state` (SMALLINT) — the authoritative state value.
- `organization_details.state_metadata` (JSONB) — the audit trail, validated against a strict JSON Schema on every save.

Low-level infrastructure (metadata writes, logging, transition-user validation) is shared with `Namespaces::Stateful` through four `Gitlab::TenantContainerLifecycle::Stateful` modules.

This ADR records the *mechanism* only. The state catalog, transitions, and conventions for adding new states live in the [Organization Lifecycle](../lifecycle.md) blueprint, which is the single source of truth.

## Consequences

- All state changes go through the state machine — direct assignment to `organizations.state` is invalid.
- `state_metadata` uses `additionalProperties: false`: any MR adding a metadata field must update `organization_detail_state_metadata.json` in the same MR, or saves will fail validation.
- Transition services must pass `transition_user:`; the machine enforces this through `ensure_transition_user`.
- The shared `TenantContainerLifecycle::Stateful` modules must stay backward-compatible with both `Organizations::Stateful` and `Namespaces::Stateful`.
- New states and transitions do not require new ADRs — they ship in the blueprint, the schema, and the state machine. A new ADR is only needed when the mechanism itself changes.

## Alternatives

### Single boolean flag (`active` / `deleted`)

Rejected: a boolean cannot represent intermediate states (confirmation, in-flight hard deletion). No audit trail, no guards.

### Separate columns per concern (`is_confirmed`, `confirmed_at`, `soft_deleted_at`, …)

Rejected: nothing enforces mutual exclusivity, so an Organization could appear simultaneously `confirmed` and mid-hard-deletion. Guards and audit become ad-hoc per-feature code. This is the approach the legacy namespace deletion used (`group_deletion_schedules`, `marked_for_deletion_at`) and that we are moving away from — see the [Group and Project Operations blueprint](../../group_and_project_operations_and_state_management/_index.md).

### Renamed intermediate state (`confirmation_in_progress` / `activation_in_progress`)

Discussed in the [intermediate-state naming thread](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/19655/diffs#note_3313088904).

Rejected: the `_in_progress` convention in the namespace lifecycle names the background process performing the operation (user says "delete" → `deletion_in_progress`). Here the user is confirming the Organization's structure, not kicking off a "confirmation" process; `confirmation_in_progress` would imply the user is mid-action. `confirmed` + `active` keep the user's completed action and the system's completed activation as two distinct, durable states.

### Reuse `Namespaces::Stateful` directly

Rejected: Organizations are not namespaces — no parent, no inheritance, no archival, no transfer. Sharing the full namespace machine would mean conditional branching for org-specific behavior throughout. The current design shares only the low-level infrastructure modules.
+2 −0
Original line number Diff line number Diff line
@@ -13,6 +13,8 @@ toc_hide: true
This blueprint details requirements for Organizations to be isolated.
Read more about what an Organization is in [Organization](_index.md).

Isolation flags are orthogonal to the Organization lifecycle (`unconfirmed`, `confirmed`, `active`, etc.) described in [Organization Lifecycle](lifecycle.md), with one dependency: the first isolation step (`isolation_desired`) requires the organization to be `active`.

## What?

All Organization data and functionality in GitLab will be isolated.
+201 −0
Original line number Diff line number Diff line
---
title: Organization Lifecycle
description: How Organizations move from creation to soft- and hard-deletion, and how every transition is audited.
status: ongoing
creation-date: "2026-05-05"
authors: [ "@rymai" ]
dris: [ "@rymai" ]
owning-stage: "~devops::tenant scale"
participating-stages: []
toc_hide: true
---

<!-- Design Documents often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->

## Summary

An Organization moves through five states: `unconfirmed``confirmed``active``soft_deleted``deletion_in_progress`. Owners can soft-delete an `active` Organization (which hides it from the UI and public API) and restore it. Only instance admins can escalate a `soft_deleted` Organization to hard deletion, which is irreversible. Every transition is audited in a JSONB column on `organization_details`.

We use the [`state_machine` gem](https://github.com/state-machines/state_machines) and share low-level infrastructure with `Namespaces::Stateful` through `Gitlab::TenantContainerLifecycle::Stateful` modules. See [ADR 009](decisions/009_state_machine.md) for the rationale.

## Goals and non-goals

Goals:

- A machine-enforced lifecycle with explicit allowed transitions.
- An immutable audit trail for every transition, stored alongside the Organization.
- Reversible soft-deletion for owners; admin-gated hard-deletion for legal/GDPR follow-through.
- Shared infrastructure with the namespace state machine to avoid duplication.

Non-goals:

- Archival (a namespace concept).
- Cross-cell transfer.
- State inheritance — Organizations are roots.

## State diagram

```mermaid
stateDiagram-v2
    direction LR
    unc: unconfirmed
    con: confirmed
    act: active
    sd:  soft_deleted
    dip: deletion_in_progress

    [*]  --> unc : (organization created)
    unc  --> con : confirm
    con  --> act : activate
    act  --> sd  : soft_delete
    sd   --> act : restore
    sd   --> dip : hard_delete
    dip  --> [*]
```

There is no `deleted` state — a successful hard deletion destroys the row. `unconfirmed` and `confirmed` have no path to `soft_deleted`: an Organization that has not yet completed activation cannot be deleted.

### States

| State | Integer | Meaning |
|-------|:-:|---------|
| `unconfirmed` | 0 | Newly created; not yet usable. |
| `soft_deleted` | 1 | Hidden from UI and public API; owners can restore, admins can hard-delete. |
| `deletion_in_progress` | 2 | Hard-deletion worker is running; the row is destroyed on success. |
| `confirmed` | 3 | Owner has confirmed; background provisioning is running. |
| `active` | 4 | Provisioning complete; fully operational. |

Integer values are append-only and reflect introduction order, not lifecycle order.

### Transitions

| Event | Source → Target | Required arguments |
|-------|-----------------|--------------------|
| `confirm` | unconfirmed → confirmed | `transition_user`, `confirmed_by_user` |
| `activate` | confirmed → active | — |
| `soft_delete` | active → soft_deleted | `transition_user` |
| `restore` | soft_deleted → active | `transition_user` |
| `hard_delete` | soft_deleted → deletion_in_progress | `transition_user` |

Every transition records who triggered it through `update_state_metadata`. Failures call `update_state_metadata_on_failure`, which writes `last_error` and emits a structured log without changing state.

Authorization for `soft_delete`, `restore`, and `hard_delete` is enforced at the [service layer](#service-entry-points). The state machine only checks that `transition_user` is supplied.

## Data model

```sql
organizations
  state  SMALLINT  NOT NULL  DEFAULT 0

organization_details
  soft_deleted_at  TIMESTAMP WITH TIME ZONE
  state_metadata   JSONB  NOT NULL  DEFAULT '{}'
```

`state_metadata` is validated against a strict JSON Schema (`organization_detail_state_metadata.json`, `additionalProperties: false`):

```json
{
  "last_updated_at":         "<datetime>",
  "last_changed_by_user_id": <integer | null>,
  "last_error":              "<string | null>",
  "correlation_id":          "<string | null>",
  "soft_deleted_by_user_id": <integer | null>,
  "restored_at":             "<datetime | null>",
  "restored_by_user_id":     <integer | null>,
  "confirmed_at":            "<datetime | null>",
  "confirmed_by_user_id":    <integer>
}
```

Fields are exposed as typed accessors on `OrganizationDetail` through `jsonb_accessor`.

## Adding a new state or transition

A state-machine change spans two repositories:

1. In `gitlab-org/gitlab`, in a single MR: `Organizations::Stateful` (state enum, `state_machine` block, guards, callbacks) **and** `organization_detail_state_metadata.json` if the new state adds metadata fields. The schema and the code must land together — `additionalProperties: false` will fail saves in production otherwise.
2. In `gitlab-com/content-sites/handbook` (this repository): this blueprint — states table, transitions table, future-work table.

Cross-link the two MRs and merge them together.

Integer values are append-only — assign the next free integer, regardless of lifecycle position.

## Service entry points

Every user-driven transition has a dedicated service that wraps the state-machine event with authorization, idempotency, and audit logging. Each one follows the same shape:

1. Check authorization through `OrganizationPolicy`.
2. Verify the current state is a valid source for the event.
3. Invoke the event with `transition_user: current_user`.
4. Surface state-machine errors as the service response if the transition did not happen.
5. Emit an audit-log event and return a successful `ServiceResponse`.

| Service | Event | Ability |
|---------|-------|---------|
| `Organizations::SoftDeleteService` | `soft_delete` | `:soft_delete_organization` |
| `Organizations::RestoreService` | `restore` | `:restore_organization` |
| `Organizations::HardDeleteService` | `hard_delete` | `:hard_delete_organization` (admin-only) |

Notes:

- `SoftDeleteService` requires the Organization to be empty (no groups nor projects) — soft deletion only hides, and is reversible.
- `HardDeleteService` enqueues the background hard-deletion worker on success; the worker performs the row destruction. Hard deletion is for legal/GDPR follow-through and is not exposed in the standard UI.

## Error handling

When a transition fails (a guard returns `false`):

- `update_state_metadata_on_failure` writes the error to `state_metadata['last_error']` and saves the detail record.
- `log_transition_failure` emits a structured error log.
- `organizations.state` is **never** modified on failure.

If a hard-deletion worker fails partway, the Organization stays in `deletion_in_progress` with `last_error` populated. Recovery is by re-running an idempotent worker, not a state-machine backward transition. A dedicated recovery transition can be added later if we need it.

## Future work

The state machine is in place; the service and API surface still need work:

| Transition | Service | GraphQL mutation | REST endpoint |
|-----------|---------|-----------------|--------------|
| `confirm` | [#598074](https://gitlab.com/gitlab-org/gitlab/-/work_items/598074) | [#596669](https://gitlab.com/gitlab-org/gitlab/-/work_items/596669) | [#596669](https://gitlab.com/gitlab-org/gitlab/-/work_items/596669) |
| `activate` | [#597856](https://gitlab.com/gitlab-org/gitlab/-/work_items/597856) | N/A (background) | N/A (background) |
| `soft_delete` | [#594308](https://gitlab.com/gitlab-org/gitlab/-/work_items/594308) — rename pending | [#594313](https://gitlab.com/gitlab-org/gitlab/-/work_items/594313) — rename pending | [#599345](https://gitlab.com/gitlab-org/gitlab/-/work_items/599345) — rename pending |
| `restore` | [#599343](https://gitlab.com/gitlab-org/gitlab/-/work_items/599343) | [#599344](https://gitlab.com/gitlab-org/gitlab/-/work_items/599344) | [#599346](https://gitlab.com/gitlab-org/gitlab/-/work_items/599346) |
| `hard_delete` | TBD — admin-only | TBD — admin-only | TBD — admin-only |

"Rename pending" rows are issues originally framed around `schedule_deletion` / `cancel_deletion` / `start_deletion` that need re-scoping to the soft-delete / restore / hard-delete naming. Finder changes to hide `soft_deleted` Organizations from non-owners are tracked in [#594312](https://gitlab.com/gitlab-org/gitlab/-/work_items/594312).

## Relationship with Organization Isolation

Lifecycle and [Isolation](isolation.md) are orthogonal. Lifecycle answers *"Is this Organization operational?"*; isolation answers *"How strictly are its data boundaries enforced?"*. They do not share a state machine, and isolation flags can be set independently of soft-deletion.

One dependency: the first isolation step (`isolation_desired`) requires the Organization to be `active`. Triggering isolation in `unconfirmed` or `confirmed` would be premature.

## Open Questions

### Concurrency and locking

Two actors could try to transition the same Organization at once — for example, an owner restores while an admin hard-deletes. Current lean: optimistic locking on `lock_version` is enough. All transitions are human-driven, so contention should be rare. If real-world conflict rates are higher than expected, we can either add a custom pessimistic-lock helper or migrate to [AASM](https://github.com/aasm/aasm#pessimistic-locking), which supports pessimistic locking natively. Decide before the first user-facing surface ships.

### Recovery from `confirmed`-state failures

If background provisioning fails after `confirm`, the Organization stays in `confirmed` indefinitely — there is no path back to `unconfirmed` or forward to a `failed` state. Are we relying on idempotent retries, or do we need a recovery transition? To be decided.

### Initial state for user-created Organizations

`unconfirmed` fits the case where GitLab provisions an Organization for a customer. Once end users create Organizations themselves (post-GA), there is no provisioning step to confirm. Two options:

- Run `confirm` + `activate` synchronously inside the creation service, so `ConfirmationService` side effects still execute.
- Allow `unconfirmed → active` directly (or default user-created rows to `active`) when no side effects are needed.

The choice depends on what side effects, if any, are bound to confirmation by the time self-service ships. See [MR thread](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/19693#note_3328588386).

### Retention window for `soft_deleted`

Should `restore` be available indefinitely, or expire after a retention window (after which only `hard_delete` is legal)? Indefinite is simplest; a fixed window (for example, 30 days) would match the prior delayed-deletion behavior and GDPR expectations. Decide before `restore` ships behind a UI.

## Alternative Solutions

See [ADR 009](decisions/009_state_machine.md) for the rationale for using a state machine over simpler data models.