Organization Data Migration: Rollback Options

Problem

The migration of organizations from the Legacy Cell to the Protocell is a one-way door as currently planned as of 17 Nov 2025. A critical concern raised by Product and Engineering is the lack of an automated, reliable rollback mechanism should a migrated organization experience:

  1. Lack of Feature Parity: The user needs a feature on the Legacy Cell that is not yet available on the Protocell, even though they had never used that feature before.
  2. Migration Failure: Something went wrong during the data transfer process.
  3. Operational Issues: Features are broken, or performance is unacceptable on the Protocell.

We plan to migrate dormant or inactive users prior to feature parity in order to mitigate scaling risk as soon as possible. However, any user that becomes active is likely to have a bad experience. We must balance development effort and timeline impact against the customer experience and data integrity.

Proposals

Option 1: Topology Flip (Immediate Reversion)

This approach is focused on minimizing delay to the overall Protocell migration timeline by accepting a known user experience risk for a time-boxed period.

Aspect Detail
Mechanism The organization's data on the Legacy Cell is kept read-only/frozen after the forward migration to the Protocell. If a rollback is triggered, the Topology Service is simply flipped to re-route all traffic back to the original data on the Legacy Cell.
Effort/Timeline Low Effort / Minimal Delay. This approach leverages existing Topology Service functionality and requires minimal new work, significantly reducing the risk of delaying the Cohort 1 migration timeline.
User Journey / Data Impact High Data Loss Risk. The user is instantly reverted to the state of their organization before the migration. Any work or data created/modified on the Protocell since migration will be permanently lost.
Viability Short TTL (Time-to-Live). This option is only viable for a very short duration (e.g., 1-2 weeks) after migration. Beyond this, the data loss becomes unacceptable, and the integrity of the frozen data on the Legacy Cell is at risk due to uncoordinated background jobs.
Applicability Primarily suited for Cohort 1 (inactive/dormant users) or immediate Migration Failure scenarios. The effort to rollback is low, and the risks are low.

Option 2: Full Migration Back

This approach is the required long-term solution for handling active customers but requires unknown engineering effort. If required for Cohort 1, then it will impact the Cohort 1 timeline.

Aspect Detail
Mechanism Implement a full data migration process that moves an organization's data back from the Protocell to the Legacy Cell. This requires the same level of complexity and validation as the forward migration process (Legacy → Protocell).
Effort/Timeline Unknown Effort / Some Delay. Nearly all of the Rails work is equally applicable in both directions. None of the SRE work has been done. So we still have the opportunity to ensure that all of the SRE work considers both directions during design and implementation. The bulk of the delay will likely come during testing rollback and iterating on rollback. This adds an unknown amount of work and will inevitably extend the time before the first organization can be migrated to the Protocell.
User Journey / Data Impact Data Preserved. The user's most recent changes made on the Protocell are preserved. The rollback is an explicit data migration with a planned downtime/cutover, similar to the initial migration.
Viability Long TTL / Permanent. This is a permanent and robust solution for handling active, paying customers in subsequent cohorts where data loss is non-negotiable.
Applicability Required for Cohorts 2+ (active users) and for a robust solution for Cohort 1 beyond the initial 1-2 week window. It is the only option that addresses feature parity without requiring the user to lose data. It is usually not useful for Migration Failure scenarios.

Option 3: Direct Transfer

This approach leverages Direct Transfer, GitLab's native bulk import mechanism, to perform a reverse migration from the Protocell back to the Legacy Cell.

Aspect Detail
Mechanism Use GitLab's Direct Transfer (bulk import API) to migrate an organization's data back from the Protocell to the Legacy Cell. This is the same mechanism used for the forward migration, but executed in reverse.
Effort/Timeline Low-to-Moderate Effort / Moderate Delay. Direct Transfer is a well-established, native GitLab feature with existing infrastructure and tooling. However, it still requires validation and testing. Less effort than Option 2 (Full Migration Back) but more than Option 1 (Topology Flip).
User Journey / Data Impact Most Data Preserved, IDs Change. Direct Transfer excludes certain data (such as CI/CD variables, deploy tokens, webhooks, custom fields). If present, they must be manually restored or reconfigured. Direct Transfer mutates resource IDs (groups, projects, etc.). Users will need to update any external references, integrations, or bookmarks that depend on the old IDs.
Viability Moderate TTL / Conditional. This option is not yet viable on Cells, but it is achievable. It can serve as a middle-ground solution with its own set of tradeoffs.
Applicability Suitable for Cohort 1 (inactive/dormant users) as a fallback beyond the Topology Flip window, or for emergency scenarios where data preservation is critical but ID mutation is acceptable. Not recommended for active users without careful communication about ID changes. Not recommended for users with data that Direct Transfer does not copy. Avoids the complexity of Geo-based solutions. Does not verify data integrity.

Option 4: Congregate

This approach uses Congregate, a tool created by professional services that extends GitLab's native bulk import capabilities to handle additional data types and provide enhanced migration orchestration.

Aspect Detail
Mechanism Use Congregate to migrate from the Protocell to the Legacy Cell. Congregate wraps GitLab's APIs and extends Direct Transfer with support for additional data types (CI/CD variables, webhooks, container/package registries, environments, etc.). Requires running a container with network access to both source and destination instances.
Effort/Timeline Moderate Effort / Moderate Delay. Congregate requires running a container and is more complex than Direct Transfer. However, it provides better coverage of data types and enhanced automation for complex scenarios. The effort is less than Option 2 (Full Migration Back) but more than Option 3 (Direct Transfer).
User Journey / Data Impact Most Data Preserved, IDs Change. Congregate preserves more data than Direct Transfer, including CI/CD variables, webhooks, container/package registries, environments, and other items. However, like Direct Transfer, Congregate mutates resource IDs (groups, projects, etc.). Users will need to update any external references, integrations, or bookmarks that depend on the old IDs. Some items remain unsupported (deploy tokens, audit events, services/integrations, runners).
Viability Moderate TTL / Conditional. This option is not yet viable on Cells, but it is achievable. It can serve as a middle-ground solution with better data coverage than Direct Transfer.
Applicability Suitable for organizations with data that Direct Transfer does not handle. Better for Cohort 1 (inactive/dormant users) beyond the Topology Flip window if they have data that Direct Transfer excludes. Not recommended for active users without careful communication about ID changes. Avoids the complexity of Geo-based solutions. Does not verify data integrity.

Recommendation

The options are not mutually exclusive.

Option 1 (Topology Flip) is a requirement for Migration Failure scenarios, and therefore is a requirement before migrating the first organization.

The limitations of Option 1 (Topology Flip) are largely not applicable to inactive users, so we can use it with little risk for as long as the users remain inactive.

Option 2 (Full Migration Back) is a hard requirement for Paying Users. The volume of data that we need to migrate in order to see benefits on the Legacy DBs is likely to contain at least one loud, extremely dissatisfied customer with a legitimate severity1 or severity2 ticket. We would spend a great deal of effort to support their needs. That effort should be spent upfront in testing and iteration of full migration back, rather than in emergency actions. It should be a requirement for Active Free Users for similar reasons; the amount of effort to migrate back in an ad-hoc manner could be very costly.

Options 3 (Direct Transfer) and 4 (Congregate) provide intermediate solutions for Cohort 1 users beyond the Topology Flip window. These options can be used only on organizations that do not have data that they do not handle. But these options mutate resource IDs, requiring users to update external references. Option 3 (Direct Transfer) is simpler and requires no additional infrastructure, making it suitable for organizations with straightforward configurations. Option 4 (Congregate) is better suited for organizations with complex configurations (CI/CD variables, webhooks, registries, environments) that Direct Transfer cannot handle. Both options avoid the complexity of Geo-based solutions and can serve as fallbacks for Option 2. However, neither option verifies data integrity, and both are constrained by feature parity—only compatible groups/projects can be migrated back.

Action: SREs in grouptenant services should flesh out the estimated additional effort for Option 2 (Full Migration Back) and present these findings for a formal decision on its timeline and resource allocation.

Questions

  1. How long should we retain old data on the Legacy Cell? GitLab reaps no benefits on the Legacy DBs until the old data is deleted. I propose 2-4 weeks.
  2. Should we keep Cohort 1 read-only/archived for a long time, and offer rollback if they become active? If Feature Parity is totally insufficient even for low-active users, then yes. However, we cannot delete old data on the Legacy Cell until we have sufficient Feature Parity to disable read-only on the Protocell, so insufficient Feature Parity for a Cohort is a blocker for any benefits to GitLab.
Edited by Michael Kozono