Skip to content

Use new temp index for backfilling artifact expiry

What does this MR do?

  • Uses the new artifact expiration temp index for updating artifacts without an expiry date
  • Reorganizes the background migration to be tidier

Related !55093 (merged)

The index was originally added in !51822 (merged)


When we interpret the scopes this query ends up being:

Ci::JobArtifact.where(expire_at: nil).where("date(created_at AT TIME ZONE 'UTC') < ?::date", Date.new(2020, 06, 22)).where(id: 1_000_000..1_200_000).each_batch(of: 1_000) do |batch|
  batch.where(Ci::JobArtifact.arel_table[:created_at].lt(15.months.ago)).update_all(expire_at: old_artifact_expiry_date)
  batch.where(Ci::JobArtifact.arel_table[:created_at].gt(15.months.ago)).update_all(expire_at: recent_artifact_expiry_date)
end

In terms of raw SQL, the batching looks like this:

SELECT "ci_job_artifacts".* FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" IS NULL AND (date(created_at AT TIME ZONE 'UTC') < '2020-06-22'::date) AND "ci_job_artifacts"."id" BETWEEN 1000000 AND 1200000 AND "ci_job_artifacts"."id" >= 100000 AND "ci_job_artifacts"."id" < 1001000

This is what the execution plan looks for that query

Index Scan using expired_artifacts_temp_index on public.ci_job_artifacts  (cost=0.56..37.17 rows=31 width=130) (actual time=6.427..6.428 rows=0 loops=1)
   Index Cond: ((ci_job_artifacts.id >= 1000000) AND (ci_job_artifacts.id <= 1200000) AND (ci_job_artifacts.id >= 100000) AND (ci_job_artifacts.id < 1001000))
   Buffers: shared read=4
   I/O Timings: read=6.383
Time: 9.642 ms
  - planning: 3.170 ms
  - execution: 6.472 ms
    - I/O read: 6.383 ms
    - I/O write: N/A

Shared buffers:
  - hits: 0 from the buffer pool
  - reads: 4 (~32.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/2393/commands/7335


The actual old artifact update look like this:

UPDATE "ci_job_artifacts" SET "expire_at" = '4713-01-01 00:00:00 BC' WHERE "ci_job_artifacts"."expire_at" IS NULL AND (date(created_at AT TIME ZONE 'UTC') < '2020-06-22'::date) AND "ci_job_artifacts"."id" BETWEEN 1000000 AND 1200000 AND "ci_job_artifacts"."id" >= 1000001 AND "ci_job_artifacts"."id" < 1001000 AND "ci_job_artifacts"."created_at" < '2019-11-24 14:57:34.360379'

Execution plan

ModifyTable on public.ci_job_artifacts  (cost=0.56..15.18 rows=10 width=140) (actual time=0.027..0.027 rows=0 loops=1)
   Buffers: shared hit=4
   ->  Index Scan using expired_artifacts_temp_index on public.ci_job_artifacts  (cost=0.56..15.18 rows=10 width=140) (actual time=0.025..0.025 rows=0 loops=1)
         Index Cond: ((ci_job_artifacts.id >= 1000000) AND (ci_job_artifacts.id <= 1200000) AND (ci_job_artifacts.id >= 1000001) AND (ci_job_artifacts.id < 1001000) AND (ci_job_artifacts.created_at < '2019-11-24 14:57:34.360379+00'::timestamp with time zone))
         Filter: ((ci_job_artifacts.expire_at IS NULL) AND (date(timezone('UTC'::text, ci_job_artifacts.created_at)) < '2020-06-22'::date))
         Rows Removed by Filter: 0
         Buffers: shared hit=4
Time: 0.547 ms
  - planning: 0.412 ms
  - execution: 0.135 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 4 (~32.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/2393/commands/7336


The actual recent artifact update look like this:

UPDATE "ci_job_artifacts" SET "expire_at" = '4713-01-01 00:00:00 BC' WHERE "ci_job_artifacts"."expire_at" IS NULL AND (date(created_at AT TIME ZONE 'UTC') < '2020-06-22'::date) AND "ci_job_artifacts"."id" BETWEEN 1000000 AND 1200000 AND "ci_job_artifacts"."id" >= 1000001 AND "ci_job_artifacts"."id" < 1001000 AND "ci_job_artifacts"."created_at" > '2019-11-24 14:57:34.360379'

Execution plan:

ModifyTable on public.ci_job_artifacts  (cost=0.56..27.92 rows=22 width=140) (actual time=0.025..0.026 rows=0 loops=1)
   Buffers: shared hit=4
   ->  Index Scan using expired_artifacts_temp_index on public.ci_job_artifacts  (cost=0.56..27.92 rows=22 width=140) (actual time=0.023..0.023 rows=0 loops=1)
         Index Cond: ((ci_job_artifacts.id >= 1000000) AND (ci_job_artifacts.id <= 1200000) AND (ci_job_artifacts.id >= 1000001) AND (ci_job_artifacts.id < 1001000) AND (ci_job_artifacts.created_at > '2019-11-24 14:57:34.360379+00'::timestamp with time zone))
         Filter: ((ci_job_artifacts.expire_at IS NULL) AND (date(timezone('UTC'::text, ci_job_artifacts.created_at)) < '2020-06-22'::date))
         Rows Removed by Filter: 0
         Buffers: shared hit=4
Time: 0.507 ms
  - planning: 0.362 ms
  - execution: 0.145 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 4 (~32.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/2393/commands/7337

Related https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10177

Does this MR meet the acceptance criteria?

Conformity

Edited by Matija Čupić

Merge request reports