Backfill ci_finished_builds with stage_name and other required fields
Summary
- Backfill the
ci_finished_buildsClickHouse table with thestage_nameand other mentioned columns in the scope for all records from the past 180 days (6 months). - This will enable users to analyze historical job metrics grouped by stage immediately upon feature release, rather than waiting for data to accumulate.
Context
The stage_name column has been added to the ci_finished_builds ClickHouse table and is now being synced for new builds. However, historical records (prior to the sync implementation) lack this field, creating a data gap that would negatively impact user experience and dashboard adoption.
Scope
Columns Included
stage_namenamespace_pathgroup_namefailure_reasonmanualallow_failureuser_idartifact_filenameartifact_sizeretries_countrunner_tagsjob_definition_id
Columns Excluded (Deferred)
-
tags- Excluded to reduce complexity and ensure backfill can complete within the milestone timeline. May be considered in a future iteration.
Rollout Strategy
GitLab.com
- Target: Start running migration in 18.9
- The backfill will be initiated on .com and run until completion
- Estimated duration: TBD (will be determined via database-testing CI job)
Self-Managed
- Migration to be added by 18.9 at the latest (before 18.11 required stop)
- Finalization in 19.0 after .com backfill completes
- Reference: Batched Background Migrations documentation
Feature Enablement
The "group by stage name" feature will be enabled only after the batched background migration completes:
- GitLab.com: Feature flag enabled once migration finishes
- Self-managed: Follows the standard finalization process at a required stop
This approach ensures users don't see incomplete data for older jobs.
Note: FF group by stage name yet to be implemented.
Implementation Approach (Draft)
Use BatchedBackgroundMigration to backfill stage_name for all ci_finished_builds records with finished_at in the last 180 days. The migration should:
- Join
ci_buildswithci_stagesto fetch the stage_name - Update records using the
ReplacingMergeTreeversioning mechanism (version, deleted columns) - Run on a replica to minimize impact on production Postgres
Related Issues and MRs
- Parent: #580441 (closed) - Sync stage_name to ClickHouse for job analytics grouping
- Related: #464713 - Backfill root_namespace_id in ci_finished_builds
- Related: Support `stage_name` in `CiJobAnalytics` GraphQ... (!217156 - merged)
- Related: Sync `stage_name` to `ci_finished_builds` click... (!217043 - merged)
- Related: Add `stage_name` to `ci_finished_builds` ClickH... (!216825 - merged)
- DevAnalytics Observer (CH importer): https://gitlab.com/gitlab-org/quality/observer/-/blob/main/app/services/transformers/build_event.rb?ref_type=heads
- Observability team OpenTelemetry exporter: https://gitlab.com/gitlab-org/gitlab/blob/608553b90d1fbe443da0c585785c211762cefa83/app/services/ci/observability/export_service.rb#L83-86
Edited by Narendran