Perform an analysis of GitLab.com users to define Cell Migration Cohorts
Data Validation & Cohort Definitions for Protocells Migration
Summary
As part of the Protocells initiative to address GitLab.com database scaling challenges, we need to identify and define user cohorts for phased migration to isolated cells. This work will establish a risk-based framework for selecting users/namespaces to migrate, starting with the lowest risk cohorts and progressively moving to higher risk groups.
Background
GitLab.com is facing critical database capacity constraints that threaten platform viability. The Protocells initiative aims to relieve pressure on the main database ("Dumbo") by migrating specific user cohorts to smaller, isolated cells. To ensure a successful migration with minimal disruption, we need to:
- Identify users/namespaces that can be safely migrated
- Assess risk levels for different user segments
- Define a phased migration approach based on risk tolerance
Objectives
- Primary Goal: Define data-driven user cohorts for Protocells migration within 2 weeks
- Risk Assessment: Establish clear risk criteria and categorization for each cohort
- Migration Waves: Design a phased approach starting with lowest-risk users
Cohort Risk Framework
Low Risk Users (Wave 1)
Users meeting ALL of the following criteria:
-
Dormant/Inactive: No activity in the last [X] months -
Namespace Isolation: Only interact with a single root namespace -
Free Tier: No paid subscriptions or add-ons -
Limited Feature Usage: Use only basic features (to be defined)
Medium Risk Users (Wave 2)
-
Active users with single namespace interaction -
Free tier with regular activity -
May use additional features beyond basic set
Higher Risk Users (Future Waves)
-
Paid tier users -
Cross-namespace interactions -
Heavy feature usage -
Integration dependencies
Deliverables
-
Data Analysis Report
- SQL queries to identify user cohorts
- Cohort size estimates
- Feature usage analysis per cohort
- Risk assessment methodology
-
Cohort Definition Document
- Clear criteria for each cohort
- Migration wave assignments
- Risk mitigation strategies
- Success metrics
-
Feature Compatibility Matrix
- List of features available on Protocells
- Feature usage by cohort
- Gap analysis for required features
Data Points to Analyze
- User activity patterns (last login, last activity)
- Namespace interaction analysis (single vs. multiple)
- Subscription status and tier
- Feature usage metrics:
- CI/CD pipeline usage
- Container Registry usage
- Git operations frequency
- Issue/MR activity
- Wiki/Pages usage
- Integration usage
Success Criteria
-
Identified cohorts representing significant database load reduction potential -
Clear, data-driven criteria for cohort selection -
Risk assessment validated with stakeholders -
Migration waves prioritized by risk/benefit ratio -
Feature gaps documented and addressed
Timeline
- Due Date: July 10, 2025 (2 weeks from offsite)
- Week 1: Data extraction and initial analysis
- Week 2: Cohort refinement, risk assessment, and documentation
Related References
- GTS Offsite Madrid Document
- Protocells Design Document
- Previous analysis: "3 million users haven't touched anything outside 1 namespace over 6-7 years"
Stakeholders
/cc @glopezfernandez @nhxnguyen @andrewn @tkuah @sxuereb @marin