Skip to content

Perform an analysis of GitLab.com users to define Cell Migration Cohorts

Data Validation & Cohort Definitions for Protocells Migration

Summary

As part of the Protocells initiative to address GitLab.com database scaling challenges, we need to identify and define user cohorts for phased migration to isolated cells. This work will establish a risk-based framework for selecting users/namespaces to migrate, starting with the lowest risk cohorts and progressively moving to higher risk groups.

Background

GitLab.com is facing critical database capacity constraints that threaten platform viability. The Protocells initiative aims to relieve pressure on the main database ("Dumbo") by migrating specific user cohorts to smaller, isolated cells. To ensure a successful migration with minimal disruption, we need to:

  1. Identify users/namespaces that can be safely migrated
  2. Assess risk levels for different user segments
  3. Define a phased migration approach based on risk tolerance

Objectives

  • Primary Goal: Define data-driven user cohorts for Protocells migration within 2 weeks
  • Risk Assessment: Establish clear risk criteria and categorization for each cohort
  • Migration Waves: Design a phased approach starting with lowest-risk users

Cohort Risk Framework

Low Risk Users (Wave 1)

Users meeting ALL of the following criteria:

  • Dormant/Inactive: No activity in the last [X] months
  • Namespace Isolation: Only interact with a single root namespace
  • Free Tier: No paid subscriptions or add-ons
  • Limited Feature Usage: Use only basic features (to be defined)

Medium Risk Users (Wave 2)

  • Active users with single namespace interaction
  • Free tier with regular activity
  • May use additional features beyond basic set

Higher Risk Users (Future Waves)

  • Paid tier users
  • Cross-namespace interactions
  • Heavy feature usage
  • Integration dependencies

Deliverables

  1. Data Analysis Report

    • SQL queries to identify user cohorts
    • Cohort size estimates
    • Feature usage analysis per cohort
    • Risk assessment methodology
  2. Cohort Definition Document

    • Clear criteria for each cohort
    • Migration wave assignments
    • Risk mitigation strategies
    • Success metrics
  3. Feature Compatibility Matrix

    • List of features available on Protocells
    • Feature usage by cohort
    • Gap analysis for required features

Data Points to Analyze

  • User activity patterns (last login, last activity)
  • Namespace interaction analysis (single vs. multiple)
  • Subscription status and tier
  • Feature usage metrics:
    • CI/CD pipeline usage
    • Container Registry usage
    • Git operations frequency
    • Issue/MR activity
    • Wiki/Pages usage
    • Integration usage

Success Criteria

  • Identified cohorts representing significant database load reduction potential
  • Clear, data-driven criteria for cohort selection
  • Risk assessment validated with stakeholders
  • Migration waves prioritized by risk/benefit ratio
  • Feature gaps documented and addressed

Timeline

  • Due Date: July 10, 2025 (2 weeks from offsite)
  • Week 1: Data extraction and initial analysis
  • Week 2: Cohort refinement, risk assessment, and documentation

Related References

Stakeholders

/cc @glopezfernandez @nhxnguyen @andrewn @tkuah @sxuereb @marin

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information