Fix cross-partition duplicate sync events in ClickHouse pipeline sync
Summary
Duplicate sync events can be created across different partitions in p_ci_finished_pipeline_ch_sync_events when a pipeline transitions to an unlockable status multiple times (e.g., manual → success, or success → success on retry). The ClickHouse MVs using countState() count these duplicates at INSERT time, inflating pipeline counts by ~0.37%.
Root Cause
- The pipeline state machine allows multiple transitions to
UNLOCKABLE_STATUSES(success, failed, canceled, skipped) - Each transition triggers
Ci::PipelineFinishedWorker, which upserts a sync event - The
p_ci_finished_pipeline_ch_sync_eventstable uses sliding list partitions with uniqueness constraint scoped to each partition - If partition rotation occurs between transitions, duplicates can exist across partitions
- The ClickHouse MVs use
countState()which accumulates counts immediately - once a duplicate is counted, it cannot be "uncounted"
Solution: uniqCombinedState(12)(id) for Duplicate-Resistant Counting
Replace countState() with uniqCombinedState(12)(id) in the MVs to make them inherently idempotent to duplicates. Inserting the same pipeline ID twice will only count it once.
Why uniqCombined(12)
| Aspect |
countState() (current) |
uniqCombinedState(12)(id) (proposed) |
|---|---|---|
| Duplicate handling |
|
|
| Idempotent inserts |
|
|
| PG-side checks needed | Required (complex) | Not required |
| Accuracy | Exact (but wrong due to dupes) | ~1-2% error (acceptable for analytics) |
Storage Impact (Measured on Production)
| Metric | Current | After Change | Impact |
|---|---|---|---|
count_pipelines column (hourly) |
91 MiB compressed | ~730 MiB | +8x |
| Total hourly table | 6.38 GiB compressed | ~7.0 GiB | +10% |
The storage increase is acceptable because count_pipelines is only 0.81% of the total table size.
Implementation Required
-
Query layer (
lib/click_house/models/ci/finished_pipelines_base.rb):# Change from: Arel::Nodes::NamedFunction.new('countMerge', [...]) # To: Arel::Nodes::NamedFunction.new('uniqCombinedMerge(12)', [...]) -
Table schemas (
ci_finished_pipelines_hourly+ci_finished_pipelines_daily):-- Change from: count_pipelines AggregateFunction(count) -- To: count_pipelines AggregateFunction(uniqCombined(12), UInt64) -
MV definitions (
ci_finished_pipelines_hourly_mv+ci_finished_pipelines_daily_mv):-- Change from: countState() AS count_pipelines -- To: uniqCombinedState(12)(id) AS count_pipelines -
Coordinate with !222153 (closed) - Update backfill migrations to use
uniqCombinedState(12)(id) -
Close !222300 (closed) - PostgreSQL-side cross-partition check is no longer needed
References
- ClickHouse
uniqCombineddocs: https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/uniqcombined - Parent issue: #586319
- MR to close: !222300 (closed) (superseded)
- MR to coordinate with: !222153 (closed) (MV rebuild/backfill)