Establish Ownership, Process, and SLOs for DB Migration Squashing
Context
Follow up to issue #81 (comment 2475486069)
Problem Statement
The runtime of RSpec background migration jobs and related jobs (e.g., db:check:schema) is increasing linearly with the growing number of DB migrations in our codebase and affects pipeline runtime across all tiers. This indicates a maintenance problem that requires regular squashing of DB migrations. However, we currently lack a sustainable approach to DB migration squashing that addresses the challenges of context, ownership, process, and scope.
Current Situation
- The latest DB migration squashing was performed in gitlab-org/quality/engineering-productivity/team#564 (closed) and some MRs for this squash contained over 1,000 file changes!
- Key challenges identified:
- Context: DB migrations involve changes specific to multiple stages and groups; no single team can gather all context
- Ownership: DevEx teams are not positioned as appropriate owners as they lack necessary context.
- Process: Incomplete documentation, tooling, and processes mean migrations must be done manually.
- Scope: Squashing requires updating code references and rewriting RSpecs to match new schema, crossing multiple team boundaries.
Impact
This initiative addresses several critical pain points affecting team performance and our core values of
-
Blocked iteration cycles - Unrelated test failures and long running Rspec migrations delay MR approvals.
- Engineers spend valuable time waiting rather than delivering incremental value.
-
Delayed Incident Response -
Capacity issues and Context switching: When DB squashing does occur, it forces team members to: - Create and review overly complex, large MRs
- Context-switch away from their planned work
- Divert capacity from value-adding tasks to maintenance work
-
System performance degradation: Our current reactive approach results in: - Steadily increasing pipeline runtimes across all tiers
- Accumulating technical debt
- Growing long-term maintenance burden and performance issues.
Solution
Short term
Manually https://docs.gitlab.com/development/database/migration_squashing/
Long term
We need to establish:
-
Shared Ownership Model:
- DevEx teams to build necessary tooling and processes
- Stage/group teams to handle migrations related to their areas
- Clear responsibility across cross-stage migrations
-
Standardized Process:
- Documented procedure for DB migration squashing
- Defined SLOs for migration frequency (e.g., monthly, quarterly )
- Automated tooling to assist with migration squashing (e.g related discussion)
-
Proactive Approach:
- Regular scheduled migration squashing instead of reactive approach
- Monitoring system to track growth of migrations and alert when approaching thresholds
Goals
-
Identify stakeholders from each stage/group to participate in defining ownership model -
Document current process based on recent migration squashing work -
Define SLOs for DB migration squashing (frequency, timeline, etc.) -
Explore automation opportunities to reduce manual effort -
Create proposal for shared ownership model
Related work
- gitlab-org/quality/engineering-productivity/team#564 (closed) - Previous DB migration squashing work
-
Active discussion about
start-as-if-fosspipelines - Optimize background migrations specs (gitlab-org/gitlab#519578)
Edited by Abhinaba Ghosh