Fix cross-partition duplicate sync events in ClickHouse pipeline sync

What does this MR do and why?

Fixes cross-partition duplicate sync events in p_ci_finished_pipeline_ch_sync_events that cause MV drift in ClickHouse.

Related to #588891 (closed)

Problem

The p_ci_finished_pipeline_ch_sync_events table uses a sliding list partition strategy with daily rotation. Pipelines can transition to UNLOCKABLE_STATUSES multiple times (e.g., manualsuccess, or successsuccess on retry), each triggering PipelineFinishedWorker.

When these transitions occur across a partition boundary:

Time T1 (partition=7910): Pipeline 123 → manual → sync event (123, 7910)
Time T2: Partition rotates, default becomes 7911  
Time T3 (partition=7911): Pipeline 123 → success → sync event (123, 7911) ← No conflict!

The upsert's unique_by: [:pipeline_id, :partition] doesn't prevent cross-partition duplicates because PostgreSQL partitioned tables require the partition key in unique constraints.

Both sync events are then processed, causing duplicate inserts to ClickHouse which inflates counts in materialized views.

Solution

Add an exists_for_pipeline? method to FinishedPipelineChSyncEvent and use it in PipelineFinishedWorker to check across all partitions before upserting. This prevents duplicate sync events while keeping the first event's values (acceptable trade-off documented in #588891 (closed)).

Evidence from production

  • Post-fix drift was 1.77% (higher than pre-fix average of 0.87%)
  • ~0.37% of sync events are duplicates across partitions
  • All sampled duplicates show events in adjacent partitions (e.g., [7910, 7911])
  • Time between duplicate events ranges from 18 minutes to 6+ hours (time between state transitions)

Future optimization

The current implementation queries all partitions via the existing index_p_ci_finished_pipeline_ch_sync_events_on_pipeline_id index. A future optimization could query only the preceding partition, though this adds complexity and the performance gain may be minimal given the index.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Pedro Pombeiro

Merge request reports

Loading