Ensuring Safe Execution of Post-Deployment Migrations in Self-Managed GitLab Instances

Context

The Container Registry currently supports schema migrations categorized as:

  • Pre-deployment migrations – Short-running migrations executed during registry startup via the CLI. Ideally, these run automatically through a helper (e.g., a Kubernetes job that invokes the CLI before starting a new registry).
  • Post-deployment migrations – Longer-running migrations that can be executed alongside Pre-deployment migrations or deferred until after the registry is operational to prevent startup delays.

Current Post-deployment Migration Process

Environment Post-deployment Execution
GitLab.com Applied manually by the registry DB admin after Pre-deployment runs automatically
Omnibus (self-managed) Applied together with Pre-deployment manually
Charts (self-managed) Applied together with Pre-deployment automatically

Goal

We want to enable self-managed deployments to be able to manually apply Post-deployment migrations separately from Pre-deployment migrations. This would allow users to avoid delaying registry startup during upgrades, similar to GitLab.com’s approach. However, this requires a safe method to ensure Post-deployment migrations can be run independently without risk of migration corruption or breaking our zero downtime guarantees.

Problem

While enabling users to defer Post-deployment migrations is desirable, the current CLI implementation (when paired with the standard Post-deployment migration pattern) has issues that can lead to migration corruption or downtime that falls outside of Gitlab's zero downtime guarantees. Specifically:

Issue: The --skip-post-deployment Flag Can Cause a Broken Migration State

Consider the migration sequence:

  1. 1_first_migration
  2. 2_post_second_migration (post-deployment)
  3. 3_third_migration

If a user skips Post-deployment migrations and runs:

sudo gitlab-ctl registry-database migrate up --skip-post-deployment

The system applies:

1_first_migration
3_third_migration
OK: applied 2 migrations in Xs

However, when later attempting to run:

sudo gitlab-ctl registry-database migrate up

or (when/if available):

sudo gitlab-ctl registry-database migrate up --post-deployment

They encounter:

failed to run database migrations: applying migration 2_post_second_migration: Unable to create migration plan because of: unknown migration with version id 2 in database

This occurs because migration 2 was skipped while 3 was applied, causing an irreversible version gap. Migration systems track applied versions sequentially, and skipping a required version breaks this chain.

Why This Hasn’t Been an Issue on GitLab.com

GitLab.com has likely not encountered this issue because we have consistently placed post-deployment migrations (when available) as the final set of migrations in each GitLab.com registry version deployment. When releasing new registry versions, we ensure that Post-deployment migrations are executed manually—immediately after any new Pre-deployment migrations and always before introducing a newer registry version with newer schema migrations. This approach has prevented versioning conflicts by ensuring that there are no unapplied Post-deployment migrations to interfere with subsequent migrations.

Risk in Self-Managed Environments

Unlike GitLab.com, which undergoes multiple registry version deployments within a milestone, self-managed instances transition only once per milestone—receiving the cumulative changes from all registry versions in that milestone. Because of this difference in release patterns and the fact that self-managed environments handle migrations independently, users may upgrade to a GitLab version where the registry includes a mix of Pre-deployment and Post-deployment migrations. If Post-deployment migrations are skipped, this can lead to:

  • Previously skipped Post-deployment migrations becoming permanently unapplicable.
  • A cycle where Post-deployment and Pre-deployment migrations must be applied in sequence repeatedly (see example scenario here).

Required Guarantees

To prevent migration corruption and maintain Gitlab's zero downtime guarantees, the migration system must enforce three rules:

  • Prevent skipping dependent post-deployment migrations: Skipping a Post-deployment migration must fail if it depends on a Pre-deployment migration in any way. If a Post-deployment migration is required—whether functionally or due to schema migration versioning—to apply subsequent Pre-deployment migrations, the system must enforce its execution to maintain the correct migration order.
  • Enforce a --post-deployment flag: The --post-deployment flag should apply only pending Post-deployment migrations that do not explicitly depend on an unapplied Pre-deployment migration.
  • Maintain seamless version upgrades: Consecutive GitLab versions (e.g., 17.1 to 17.2) must not introduce a sequence of Post-deployment and Pre-deployment migrations that cause downtime by repeatedly interrupting the migration process.

Solutions

Option A: Retain Existing Migration Approach with Adjustments

To maintain GitLab’s zero-downtime guarantee:

  1. Modify --skip-post-deployment behaviour to ensure Post-deployment migrations can only be skipped if they are the final migrations in an unapplied migration sequence. Because skipping would result in a broken sequence, it must fail.
  2. Introduce stricter processes/pipelines to ensure Post-deployment migrations are always the last migrations in a GitLab release cycle.
  3. Support the --post-deployment to only run Post-deployment migrations (which from 1. and 2. will always be the last in in the migration set) #1516 (comment 2364077865)

Pros:

  • Preserves zero-downtime guarantee.
  • Users can still manually apply Post-deployment migrations cleanly.

Cons:

  • Restricts flexibility: Post-deployment migrations must always be last, limiting the ability to introduce urgent Pre-deployment fixes after a Post-deployment migration has been released.
  • If an incident resolution or critical fix requiring an Pre-deployment is needed after Post-deployment migrations in a milestone, we must wait for the next GitLab version to address it or we risk breaking zero-downtime guarantees.

Option B: Decouple Pre-deployment and Post-deployment Migrations with Explicit Dependencies

Instead of relying on a sequential versioning system, we would:

  1. Separate Pre-deployment and Post-deployment migration versioning, ensuring they do not interfere unless explicitly dependent.
  2. Introduce a dependency graph that enforces when an Pre-deployment migration relies on a prior Post-deployment migration.
  3. Support the --post-deployment to only run Post-deployment migrations that do not have any dependencies on Pre-deployment migrations that are unapplied #1516 (comment 2364077865)

Pros:

  • Maintains zero-downtime guarantees.
  • Allows greater flexibility—non-dependent Post-deployment and Pre-deployment migrations can be introduced at any time.

Cons:

  • None significant.

Conclusion

I recommend option B because it decouples Pre-deployment and Post-deployment migrations, allowing greater flexibility in introducing schema changes without being constrained by strict version sequencing. This prevents migration bottlenecks, enables faster emergency fixes, and ensures smoother upgrades for self-managed users by enforcing explicit dependencies instead of relying on sequential ordering. However, this comes at the cost of increased implementation complexity, requiring a dependency tracking mechanism, stronger enforcement rules.

Edited by SAhmed