Skip to content

[Fleet Visibility] Backfill CI job performance materialized view

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

Backfill the ci_job_performance_daily_mv materialized view with historical CI job performance data.

Background

Part of the implementation for &18548. After creating the materialized view in #555974 (closed), we need to backfill it with historical data.

Blocked by #555974 (closed) - The materialized view must be created before we can backfill it.

Implementation Details

Backfill Query

The backfill should be executed using the following SQL:

INSERT INTO gitlab_clickhouse_development.ci_job_performance_daily_mv
SELECT
    toDate(b.finished_at) AS date,
    b.project_id,
    b.stage_id,
    b.name,
    p.source,
    p.ref,
    quantilesState(0.5, 0.95)(b.duration) AS duration_quantiles,
    countState() AS total_builds,
    countStateIf(b.status = 'failed') AS failed_builds
FROM gitlab_clickhouse_development.ci_finished_builds b
INNER JOIN gitlab_clickhouse_development.ci_finished_pipelines p 
    ON b.pipeline_id = p.id
WHERE b.finished_at > 0
    AND b.finished_at >= today() - INTERVAL 180 DAY
GROUP BY date, b.project_id, b.stage_id, b.name, p.source, p.ref;

Backfill Strategy

For large-scale deployments, consider backfilling in chunks by month to avoid overwhelming the system:

-- Example: Backfill one month at a time
INSERT INTO gitlab_clickhouse_development.ci_job_performance_daily_mv
SELECT
    toDate(b.finished_at) AS date,
    b.project_id,
    b.stage_id,
    b.name,
    p.source,
    p.ref,
    quantilesState(0.5, 0.95)(b.duration) AS duration_quantiles,
    countState() AS total_builds,
    countStateIf(b.status = 'failed') AS failed_builds
FROM gitlab_clickhouse_development.ci_finished_builds b
INNER JOIN gitlab_clickhouse_development.ci_finished_pipelines p 
    ON b.pipeline_id = p.id
WHERE b.finished_at > 0
    AND b.finished_at >= '2025-01-01'
    AND b.finished_at < '2025-02-01'
GROUP BY date, b.project_id, b.stage_id, b.name, p.source, p.ref;

Considerations

  1. Time Range: Default to backfilling the last 180 days to support the largest time window in the UI
  2. Performance: Monitor ClickHouse cluster performance during backfill
  3. Chunking: For GitLab.com, may need to backfill in smaller chunks (e.g., by week or by project ranges)
  4. Verification: After backfill, verify data integrity with sample queries

Acceptance Criteria

  • Backfill script is created and tested
  • Historical data for the last 180 days is loaded into the MV
  • Performance impact during backfill is acceptable
  • Data integrity is verified post-backfill

Verification Query

After backfill, verify the data with:

-- Check data distribution by date
SELECT 
    toStartOfMonth(date) AS month,
    count() AS num_days,
    sum(countMerge(total_builds)) AS total_builds
FROM gitlab_clickhouse_development.ci_job_performance_daily_mv
GROUP BY month
ORDER BY month DESC;

Note: This issue was created by Claude based on implementation discussion in epic &18548

Edited by 🤖 GitLab Bot 🤖