Add ClickHouse migrations to rebuild CI materialized views
What does this MR do and why?
This MR adds two ClickHouse migrations to rebuild the ci_finished_pipelines_daily and ci_finished_pipelines_hourly aggregate tables that were inflated by duplicate rows in ci_finished_pipelines (see #586319 / #588891 (closed)). The rebuild deletes existing aggregates month-by-month and reinserts deduplicated data using GROUP BY id with any() to avoid the cost of FINAL on large datasets.
Note: This MR only addresses ci_finished_pipelines MVs.
Related issues
- Related to #586319
- Related to #588891 (closed) (cross-partition duplicate sync events)
- Follow-up: #588909 (add version column to ci_finished_pipelines)
Solution
For each target table (ci_finished_pipelines_daily, ci_finished_pipelines_hourly) the migration:
- Fetches a reference date from ClickHouse to avoid clock skew
- Iterates month-by-month for the last 12 months including current month
- Deletes existing data for the month (or from current month onwards)
- Waits for the delete mutation to finish
- Inserts corrected aggregates by scanning the source table in
idranges
Why GROUP BY id with any() instead of FINAL?
The ci_finished_pipelines table uses ReplacingMergeTree without a version column. This means:
| Approach | Performance (270M rows/month) | Notes |
|---|---|---|
FINAL |
~10s (with version column) | Fast, but requires version column |
FINAL |
Unknown (without version column) | May be slower without optimization |
GROUP BY id + any()
|
~68s | Consistent, backport-safe to 18.6 |
GROUP BY id + argMax()
|
~99s | More deterministic but slower |
We chose GROUP BY id with any() because:
- Backport-safe: Works on 18.6 which lacks a version column
- Correct for this use case: Duplicate rows have identical values (same pipeline, just inserted twice per #588891 (closed))
- Reasonable performance: ~68s per month is acceptable for a one-time migration
Edge Cases Handled
| Issue | Fix |
|---|---|
| Day boundary crossing | Capture reference_date at migration start; use calendar month boundaries |
| Clock skew between Rails and ClickHouse | Query ClickHouse server for today() instead of Rails' Time.zone.today
|
| Mutation race condition | Poll system.mutations until DELETE mutations complete before INSERT |
| No source data | Skip inserts and log when source table is empty |
MVs Rebuilt
| # | Target Table | Source Table |
|---|---|---|
| 1 | ci_finished_pipelines_daily |
ci_finished_pipelines |
| 2 | ci_finished_pipelines_hourly |
ci_finished_pipelines |
How to set up and validate locally
Prerequisites
-
Ensure ClickHouse is running in your GDK:
gdk start clickhouse -
Verify ClickHouse connection:
clickhouse client --port 9001 -d gitlab_clickhouse_development -u default --query "SELECT 1"
Run the migrations
bundle exec rake gitlab:clickhouse:migrate:main
Run specs
bundle exec rspec spec/db/click_house/migrate/main/20260204165152_backfill_ci_finished_pipelines_daily_spec.rb --format documentation
bundle exec rspec spec/db/click_house/migrate/main/20260204165153_backfill_ci_finished_pipelines_hourly_spec.rb --format documentation
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.