Disable async_insert in build and pipeline sync operations
What does this MR do and why?
Disables async_insert in ClickHouse build and pipeline sync operations to prevent data duplication in materialized views.
Problem: The ci_finished_pipelines_daily materialized view shows significant data inflation (up to 80% for some months) compared to the source ci_finished_pipelines table. Investigation revealed that HTTP read timeouts during async inserts cause the client to believe insertions failed, triggering retries that result in duplicate data.
Root cause: The async insert settings (async_insert=1, wait_for_async_insert=1) combined with ClickHouse's wait_for_async_insert_timeout=120s can exceed the Ruby HTTP client's read timeout. When this happens:
- The HTTP client times out waiting for a response
- The insertion may have actually succeeded in ClickHouse
- Sidekiq retries the job, inserting the same data again
- The source table (
ReplacingMergeTree) deduplicates, but the MV (AggregatingMergeTree) accumulates duplicates
Solution: Remove async_insert settings entirely. Since we already batch data before insertion, async insert provides minimal benefit while introducing timeout-related reliability issues.
Changelog: fixed
References
- Original discussion: https://gitlab.com/gitlab-org/gitlab/-/work_items/586319#note_3013011851
- Parent issue: https://gitlab.com/gitlab-org/gitlab/-/work_items/586319
- Related MR (disable retries): !219149 (merged)
How to set up and validate locally
- Connect GDK to a ClickHouse Cloud instance
- Trigger pipeline/build sync operations
- Verify insertions complete without timeout errors
- Confirm no duplicate data in materialized views
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.