Fix statement timeouts on p_ci_job_artifacts during pipeline deletion

What does this MR do and why?

Fixes statement timeouts when deleting pipelines with many builds/job artifacts.

The Ci::DestroyPipelineService path calls pipeline.destroy!, which triggers use_fast_destroy :job_artifacts. This calls each_batch on pipeline.job_artifacts — a has_many :through :builds association — generating a JOIN between p_ci_job_artifacts and p_ci_builds. The LIMIT 1 ORDER BY p_ci_job_artifacts.id from each_batch tricks the PostgreSQL planner into scanning via p_ci_job_artifacts_pkey instead of using p_ci_builds_commit_id_status_type_idx, causing full table scans and statement timeouts.

A Dedicated customer experiences 100% failure rate on pipeline deletion. Pipelines can have 44K+ builds (based on data queried in ClickHouse production).

Why FastDestroyAll / use_fast_destroy doesn't work here

The use_fast_destroy :job_artifacts pattern calls perform_fast_destroy(pipeline.job_artifacts) which delegates to Ci::JobArtifact.begin_fast_destroy. Inside begin_fast_destroy, the DestroyAssociationsService calls each_batch on the has_many :through relation. This is where the problematic JOIN query is generated — the planner picks the wrong index because each_batch adds ORDER BY id ASC LIMIT 1, which biases toward scanning the primary key index.

The fundamental issue is that FastDestroyAll operates on a single flat relation and has no concept of batching through an intermediate association. It receives pipeline.job_artifacts (the through-association) and calls each_batch directly on it, which always generates the JOIN.

Why two-level batching is needed

A single flat CTE-based relation (wrapping all build IDs in one CTE) doesn't scale either. For a pipeline with 44K builds, the CTE materializes all build IDs and does a nested loop join (44K loops), taking ~18.5s — still timing out.

The solution requires two-level batching:

  1. Level 1 (builds): Batch pipeline builds in groups of 100 using Pipeline#builds_with_cte — a CTE-wrapped relation that forces the planner to use p_ci_builds_commit_id_status_type_idx
  2. Level 2 (artifacts): For each build batch, query Ci::JobArtifact.where(job_id: [batch_ids], partition_id: ...) and batch-destroy artifacts using the existing DestroyAssociationsService

This keeps each query operating on a small, bounded set.

Approach

  1. Introduce Ci::Pipelines::DestroyAssociationsService — a new service that orchestrates two-level batched destruction. It iterates Pipeline#builds_with_cte in batches of 100, and for each build batch delegates artifact destruction to Ci::JobArtifacts::DestroyAssociationsService. A single DestroyAssociationsService instance accumulates statistics across all build batches, so finalize_fast_destroy can be called once with a single service object.

  2. Replace use_fast_destroy :job_artifacts with a before_destroy callbackPipeline#destroy_job_artifact_associations calls the new service and defers statistics finalization via run_after_commit, preserving the same transactional safety as FastDestroyAll.

  3. Refactor Ci::JobArtifacts::DestroyAssociationsService — move the relation from the constructor to destroy_records(relation) as a parameter, enabling the same service instance to be reused across multiple build batches while accumulating statistics.

  4. Add Pipeline#builds_with_cte — wraps the builds query in a materialized CTE that forces the planner to use p_ci_builds_commit_id_status_type_idx. This is also reused by Ci::UnlockPipelineService, removing its duplicated CTE construction.

What stays unchanged

  • Ci::JobArtifact.begin_fast_destroy / finalize_fast_destroy — the FastDestroyAll contract on the model is preserved
  • Projects::DestroyService#destroy_orphaned_ci_job_artifacts! — uses begin_fast_destroy on a plain Ci::JobArtifact.for_project(project) scope (no through-association, no planner issue), so it continues to work unchanged
  • FastDestroyAll concern — not modified

References

Queries

OLD query (problematic — from pipeline.job_artifacts.each_batch)

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
INNER JOIN "p_ci_builds" ON "p_ci_job_artifacts"."job_id" = "p_ci_builds"."id"
WHERE "p_ci_builds"."type" = 'Ci::Build'
  AND "p_ci_builds"."commit_id" = :pipeline_id
  AND "p_ci_builds"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1 OFFSET 100;

The planner picks p_ci_job_artifacts_pkey for the ORDER BY id LIMIT 1 and does a full scan through artifacts trying to find 100 that match the JOIN condition.

NEW queries (two-level batching)

Level 1: Batch builds via CTE — find start

WITH "cte_builds" AS MATERIALIZED (
  SELECT "p_ci_builds"."id" FROM "p_ci_builds"
  WHERE "p_ci_builds"."type" = 'Ci::Build'
    AND "p_ci_builds"."commit_id" = :pipeline_id
    AND "p_ci_builds"."partition_id" = :partition_id
    AND "p_ci_builds"."status" IN (
      'created', 'waiting_for_resource', 'preparing', 'waiting_for_callback',
      'pending', 'running', 'failed', 'success', 'canceled', 'canceling',
      'skipped', 'manual', 'scheduled'
    )
)
SELECT "p_ci_builds"."id" FROM "cte_builds" AS "p_ci_builds"
ORDER BY "p_ci_builds"."id" ASC
LIMIT 1;

Level 1: Batch builds via CTE — find boundary

WITH "cte_builds" AS MATERIALIZED (
  SELECT "p_ci_builds"."id" FROM "p_ci_builds"
  WHERE "p_ci_builds"."type" = 'Ci::Build'
    AND "p_ci_builds"."commit_id" = :pipeline_id
    AND "p_ci_builds"."partition_id" = :partition_id
    AND "p_ci_builds"."status" IN (
      'created', 'waiting_for_resource', 'preparing', 'waiting_for_callback',
      'pending', 'running', 'failed', 'success', 'canceled', 'canceling',
      'skipped', 'manual', 'scheduled'
    )
)
SELECT "p_ci_builds"."id" FROM "cte_builds" AS "p_ci_builds"
WHERE "p_ci_builds"."id" >= :start_id
ORDER BY "p_ci_builds"."id" ASC
LIMIT 1 OFFSET 100;

Level 1: Fetch build batch IDs (pluck)

WITH "cte_builds" AS MATERIALIZED (
  SELECT "p_ci_builds"."id" FROM "p_ci_builds"
  WHERE "p_ci_builds"."type" = 'Ci::Build'
    AND "p_ci_builds"."commit_id" = :pipeline_id
    AND "p_ci_builds"."partition_id" = :partition_id
    AND "p_ci_builds"."status" IN (
      'created', 'waiting_for_resource', 'preparing', 'waiting_for_callback',
      'pending', 'running', 'failed', 'success', 'canceled', 'canceling',
      'skipped', 'manual', 'scheduled'
    )
)
SELECT "p_ci_builds"."id" FROM "cte_builds" AS "p_ci_builds"
WHERE "p_ci_builds"."id" >= :start_id
  AND "p_ci_builds"."id" < :stop_id;

Level 2: Batch artifacts per build batch — find start

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
WHERE "p_ci_job_artifacts"."job_id" IN (:build_id_1, :build_id_2, ..., :build_id_100)
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1;

Level 2: Batch artifacts per build batch — find boundary

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
WHERE "p_ci_job_artifacts"."job_id" IN (:build_id_1, :build_id_2, ..., :build_id_100)
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."id" >= :start_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1 OFFSET 100;

Level 2: Fetch artifact batch records

SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts"
WHERE "p_ci_job_artifacts"."job_id" IN (:build_id_1, :build_id_2, ..., :build_id_100)
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."id" >= :start_id
  AND "p_ci_job_artifacts"."id" < :stop_id;

Query Plans

OLD Approach (JOIN via has_many :job_artifacts, through: :builds)

Find first batch start ID

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
INNER JOIN "p_ci_builds" ON "p_ci_job_artifacts"."job_id" = "p_ci_builds"."id"
WHERE "p_ci_builds"."type" = 'Ci::Build'
  AND "p_ci_builds"."commit_id" = :pipeline_id
  AND "p_ci_builds"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49328/commands/147202

Find batch boundary (stop ID)

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
INNER JOIN "p_ci_builds" ON "p_ci_job_artifacts"."job_id" = "p_ci_builds"."id"
WHERE "p_ci_builds"."type" = 'Ci::Build'
  AND "p_ci_builds"."commit_id" = :pipeline_id
  AND "p_ci_builds"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."id" >= :start_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1 OFFSET 100;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49328/commands/147203

Fetch batch records

SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts"
INNER JOIN "p_ci_builds" ON "p_ci_job_artifacts"."job_id" = "p_ci_builds"."id"
WHERE "p_ci_builds"."type" = 'Ci::Build'
  AND "p_ci_builds"."commit_id" = :pipeline_id
  AND "p_ci_builds"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."id" >= :start_id
  AND "p_ci_job_artifacts"."id" < :stop_id;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49328/commands/147205

NEW Approach (two-level batching: CTE builds → artifacts)

Level 1: Batch builds — find start

WITH "cte_builds" AS MATERIALIZED (
  SELECT "p_ci_builds"."id" FROM "p_ci_builds"
  WHERE "p_ci_builds"."type" = 'Ci::Build'
    AND "p_ci_builds"."commit_id" = :pipeline_id
    AND "p_ci_builds"."partition_id" = :partition_id
    AND "p_ci_builds"."status" IN (
      'created', 'waiting_for_resource', 'preparing', 'waiting_for_callback',
      'pending', 'running', 'failed', 'success', 'canceled', 'canceling',
      'skipped', 'manual', 'scheduled'
    )
)
SELECT "p_ci_builds"."id" FROM "cte_builds" AS "p_ci_builds"
ORDER BY "p_ci_builds"."id" ASC
LIMIT 1;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49578/commands/147792

Level 1: Batch builds — find boundary

WITH "cte_builds" AS MATERIALIZED (
  SELECT "p_ci_builds"."id" FROM "p_ci_builds"
  WHERE "p_ci_builds"."type" = 'Ci::Build'
    AND "p_ci_builds"."commit_id" = :pipeline_id
    AND "p_ci_builds"."partition_id" = :partition_id
    AND "p_ci_builds"."status" IN (
      'created', 'waiting_for_resource', 'preparing', 'waiting_for_callback',
      'pending', 'running', 'failed', 'success', 'canceled', 'canceling',
      'skipped', 'manual', 'scheduled'
    )
)
SELECT "p_ci_builds"."id" FROM "cte_builds" AS "p_ci_builds"
WHERE "p_ci_builds"."id" >= :start_id
ORDER BY "p_ci_builds"."id" ASC
LIMIT 1 OFFSET 100;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49578/commands/147826

Level 1: Batch builds — query batch

WITH "cte_builds" AS MATERIALIZED (
    SELECT "p_ci_builds"."id" FROM "p_ci_builds"
    WHERE "p_ci_builds"."type" = 'Ci::Build'
      AND "p_ci_builds"."commit_id" = :pipeline_id
      AND "p_ci_builds"."partition_id" = :partition_id
      AND "p_ci_builds"."status" IN (
                                     'created', 'waiting_for_resource', 'preparing', 'waiting_for_callback',
                                     'pending', 'running', 'failed', 'success', 'canceled', 'canceling',
                                     'skipped', 'manual', 'scheduled'
        )
)
SELECT "p_ci_builds"."id" FROM "cte_builds" AS "p_ci_builds"
WHERE "p_ci_builds"."id" >= :start_id
ORDER BY "p_ci_builds"."id" ASC
LIMIT 100 OFFSET 100;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49578/commands/147827

Level 2: Batch artifacts — find start

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
WHERE "p_ci_job_artifacts"."job_id" IN (:build_id_1, :build_id_2, ..., :build_id_100)
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49578/commands/147828

Level 2: Batch artifacts — find boundary

SELECT "p_ci_job_artifacts"."id" FROM "p_ci_job_artifacts"
WHERE "p_ci_job_artifacts"."job_id" IN (:build_id_1, :build_id_2, ..., :build_id_100)
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."id" >= :start_id
ORDER BY "p_ci_job_artifacts"."id" ASC
LIMIT 1 OFFSET 100;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49578/commands/147830

Level 2: Fetch artifact batch records

SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts"
WHERE "p_ci_job_artifacts"."job_id" IN (:build_id_1, :build_id_2, ..., :build_id_100)
  AND "p_ci_job_artifacts"."partition_id" = :partition_id
  AND "p_ci_job_artifacts"."id" >= :start_id
  AND "p_ci_job_artifacts"."id" < :stop_id;

Query Plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/49578/commands/147832

How to set up and validate locally

Run the specs

# New pipeline destroy associations service
bundle exec rspec spec/services/ci/pipelines/destroy_associations_service_spec.rb

# DestroyAssociationsService (refactored API)
bundle exec rspec spec/services/ci/job_artifacts/destroy_associations_service_spec.rb

# Pipeline model specs (builds_with_cte + destroy_job_artifact_associations)
bundle exec rspec spec/models/ci/pipeline_spec.rb -e 'builds_with_cte' -e 'destroy_job_artifact_associations'

# Full pipeline destruction integration test
bundle exec rspec spec/services/ci/destroy_pipeline_service_spec.rb

# UnlockPipelineService (uses shared builds_with_cte)
bundle exec rspec spec/services/ci/unlock_pipeline_service_spec.rb

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Narendran

Merge request reports

Loading