Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Register now

Fix cross-partition duplicate sync events in ClickHouse pipeline sync

Summary

Duplicate sync events can be created across different partitions in p_ci_finished_pipeline_ch_sync_events when a pipeline transitions to an unlockable status multiple times (e.g., manual → success, or success → success on retry). The ClickHouse MVs using countState() count these duplicates at INSERT time, inflating pipeline counts by ~0.37%.

Root Cause

  1. The pipeline state machine allows multiple transitions to UNLOCKABLE_STATUSES (success, failed, canceled, skipped)
  2. Each transition triggers Ci::PipelineFinishedWorker, which upserts a sync event
  3. The p_ci_finished_pipeline_ch_sync_events table uses sliding list partitions with uniqueness constraint scoped to each partition
  4. If partition rotation occurs between transitions, duplicates can exist across partitions
  5. The ClickHouse MVs use countState() which accumulates counts immediately - once a duplicate is counted, it cannot be "uncounted"

Solution: uniqCombinedState(12)(id) for Duplicate-Resistant Counting

Replace countState() with uniqCombinedState(12)(id) in the MVs to make them inherently idempotent to duplicates. Inserting the same pipeline ID twice will only count it once.

Why uniqCombined(12)

Aspect countState() (current) uniqCombinedState(12)(id) (proposed)
Duplicate handling ❌ Counts all rows ✅ Counts unique pipeline IDs
Idempotent inserts ❌ No ✅ Yes
PG-side checks needed Required (complex) Not required
Accuracy Exact (but wrong due to dupes) ~1-2% error (acceptable for analytics)

Storage Impact (Measured on Production)

Metric Current After Change Impact
count_pipelines column (hourly) 91 MiB compressed ~730 MiB +8x
Total hourly table 6.38 GiB compressed ~7.0 GiB +10%

The storage increase is acceptable because count_pipelines is only 0.81% of the total table size.

Implementation Required

  1. Query layer (lib/click_house/models/ci/finished_pipelines_base.rb):

    # Change from:
    Arel::Nodes::NamedFunction.new('countMerge', [...])
    # To:
    Arel::Nodes::NamedFunction.new('uniqCombinedMerge(12)', [...])
  2. Table schemas (ci_finished_pipelines_hourly + ci_finished_pipelines_daily):

    -- Change from:
    count_pipelines AggregateFunction(count)
    -- To:
    count_pipelines AggregateFunction(uniqCombined(12), UInt64)
  3. MV definitions (ci_finished_pipelines_hourly_mv + ci_finished_pipelines_daily_mv):

    -- Change from:
    countState() AS count_pipelines
    -- To:
    uniqCombinedState(12)(id) AS count_pipelines
  4. Coordinate with !222153 (closed) - Update backfill migrations to use uniqCombinedState(12)(id)

  5. Close !222300 (closed) - PostgreSQL-side cross-partition check is no longer needed

References

  • ClickHouse uniqCombined docs: https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/uniqcombined
  • Parent issue: #586319
  • MR to close: !222300 (closed) (superseded)
  • MR to coordinate with: !222153 (closed) (MV rebuild/backfill)
Edited Feb 06, 2026 by Pedro Pombeiro
Assignee Loading
Time tracking Loading