Add ClickHouse migrations to rebuild CI materialized views

What does this MR do and why?

This MR adds two ClickHouse migrations to rebuild the ci_finished_pipelines_daily and ci_finished_pipelines_hourly aggregate tables that were inflated by duplicate rows in ci_finished_pipelines (see #586319 / #588891 (closed)). The rebuild deletes existing aggregates month-by-month and reinserts deduplicated data using GROUP BY id with any() to avoid the cost of FINAL on large datasets.

Note: This MR only addresses ci_finished_pipelines MVs.

  • Related to #586319
  • Related to #588891 (closed) (cross-partition duplicate sync events)
  • Follow-up: #588909 (add version column to ci_finished_pipelines)

Solution

For each target table (ci_finished_pipelines_daily, ci_finished_pipelines_hourly) the migration:

  1. Fetches a reference date from ClickHouse to avoid clock skew
  2. Iterates month-by-month for the last 12 months including current month
  3. Deletes existing data for the month (or from current month onwards)
  4. Waits for the delete mutation to finish
  5. Inserts corrected aggregates by scanning the source table in id ranges

Why GROUP BY id with any() instead of FINAL?

The ci_finished_pipelines table uses ReplacingMergeTree without a version column. This means:

Approach Performance (270M rows/month) Notes
FINAL ~10s (with version column) Fast, but requires version column
FINAL Unknown (without version column) May be slower without optimization
GROUP BY id + any() ~68s Consistent, backport-safe to 18.6
GROUP BY id + argMax() ~99s More deterministic but slower

We chose GROUP BY id with any() because:

  • Backport-safe: Works on 18.6 which lacks a version column
  • Correct for this use case: Duplicate rows have identical values (same pipeline, just inserted twice per #588891 (closed))
  • Reasonable performance: ~68s per month is acceptable for a one-time migration

Edge Cases Handled

Issue Fix
Day boundary crossing Capture reference_date at migration start; use calendar month boundaries
Clock skew between Rails and ClickHouse Query ClickHouse server for today() instead of Rails' Time.zone.today
Mutation race condition Poll system.mutations until DELETE mutations complete before INSERT
No source data Skip inserts and log when source table is empty

MVs Rebuilt

# Target Table Source Table
1 ci_finished_pipelines_daily ci_finished_pipelines
2 ci_finished_pipelines_hourly ci_finished_pipelines

How to set up and validate locally

Prerequisites

  1. Ensure ClickHouse is running in your GDK:

    gdk start clickhouse
  2. Verify ClickHouse connection:

    clickhouse client --port 9001 -d gitlab_clickhouse_development -u default --query "SELECT 1"

Run the migrations

bundle exec rake gitlab:clickhouse:migrate:main

Run specs

bundle exec rspec spec/db/click_house/migrate/main/20260204165152_backfill_ci_finished_pipelines_daily_spec.rb --format documentation
bundle exec rspec spec/db/click_house/migrate/main/20260204165153_backfill_ci_finished_pipelines_hourly_spec.rb --format documentation

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Pedro Pombeiro

Merge request reports

Loading