Use new temp index for backfilling artifact expiry
What does this MR do?
- Uses the new artifact expiration temp index for updating artifacts without an expiry date
- Reorganizes the background migration to be tidier
Related !55093 (merged)
The index was originally added in !51822 (merged)
When we interpret the scopes this query ends up being:
Ci::JobArtifact.where(expire_at: nil).where("date(created_at AT TIME ZONE 'UTC') < ?::date", Date.new(2020, 06, 22)).where(id: 1_000_000..1_200_000).each_batch(of: 1_000) do |batch|
batch.where(Ci::JobArtifact.arel_table[:created_at].lt(15.months.ago)).update_all(expire_at: old_artifact_expiry_date)
batch.where(Ci::JobArtifact.arel_table[:created_at].gt(15.months.ago)).update_all(expire_at: recent_artifact_expiry_date)
end
In terms of raw SQL, the batching looks like this:
SELECT "ci_job_artifacts".* FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" IS NULL AND (date(created_at AT TIME ZONE 'UTC') < '2020-06-22'::date) AND "ci_job_artifacts"."id" BETWEEN 1000000 AND 1200000 AND "ci_job_artifacts"."id" >= 100000 AND "ci_job_artifacts"."id" < 1001000
This is what the execution plan looks for that query
Index Scan using expired_artifacts_temp_index on public.ci_job_artifacts (cost=0.56..37.17 rows=31 width=130) (actual time=6.427..6.428 rows=0 loops=1)
Index Cond: ((ci_job_artifacts.id >= 1000000) AND (ci_job_artifacts.id <= 1200000) AND (ci_job_artifacts.id >= 100000) AND (ci_job_artifacts.id < 1001000))
Buffers: shared read=4
I/O Timings: read=6.383
Time: 9.642 ms
- planning: 3.170 ms
- execution: 6.472 ms
- I/O read: 6.383 ms
- I/O write: N/A
Shared buffers:
- hits: 0 from the buffer pool
- reads: 4 (~32.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/2393/commands/7335
The actual old artifact update look like this:
UPDATE "ci_job_artifacts" SET "expire_at" = '4713-01-01 00:00:00 BC' WHERE "ci_job_artifacts"."expire_at" IS NULL AND (date(created_at AT TIME ZONE 'UTC') < '2020-06-22'::date) AND "ci_job_artifacts"."id" BETWEEN 1000000 AND 1200000 AND "ci_job_artifacts"."id" >= 1000001 AND "ci_job_artifacts"."id" < 1001000 AND "ci_job_artifacts"."created_at" < '2019-11-24 14:57:34.360379'
Execution plan
ModifyTable on public.ci_job_artifacts (cost=0.56..15.18 rows=10 width=140) (actual time=0.027..0.027 rows=0 loops=1)
Buffers: shared hit=4
-> Index Scan using expired_artifacts_temp_index on public.ci_job_artifacts (cost=0.56..15.18 rows=10 width=140) (actual time=0.025..0.025 rows=0 loops=1)
Index Cond: ((ci_job_artifacts.id >= 1000000) AND (ci_job_artifacts.id <= 1200000) AND (ci_job_artifacts.id >= 1000001) AND (ci_job_artifacts.id < 1001000) AND (ci_job_artifacts.created_at < '2019-11-24 14:57:34.360379+00'::timestamp with time zone))
Filter: ((ci_job_artifacts.expire_at IS NULL) AND (date(timezone('UTC'::text, ci_job_artifacts.created_at)) < '2020-06-22'::date))
Rows Removed by Filter: 0
Buffers: shared hit=4
Time: 0.547 ms
- planning: 0.412 ms
- execution: 0.135 ms
- I/O read: N/A
- I/O write: N/A
Shared buffers:
- hits: 4 (~32.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/2393/commands/7336
The actual recent artifact update look like this:
UPDATE "ci_job_artifacts" SET "expire_at" = '4713-01-01 00:00:00 BC' WHERE "ci_job_artifacts"."expire_at" IS NULL AND (date(created_at AT TIME ZONE 'UTC') < '2020-06-22'::date) AND "ci_job_artifacts"."id" BETWEEN 1000000 AND 1200000 AND "ci_job_artifacts"."id" >= 1000001 AND "ci_job_artifacts"."id" < 1001000 AND "ci_job_artifacts"."created_at" > '2019-11-24 14:57:34.360379'
Execution plan:
ModifyTable on public.ci_job_artifacts (cost=0.56..27.92 rows=22 width=140) (actual time=0.025..0.026 rows=0 loops=1)
Buffers: shared hit=4
-> Index Scan using expired_artifacts_temp_index on public.ci_job_artifacts (cost=0.56..27.92 rows=22 width=140) (actual time=0.023..0.023 rows=0 loops=1)
Index Cond: ((ci_job_artifacts.id >= 1000000) AND (ci_job_artifacts.id <= 1200000) AND (ci_job_artifacts.id >= 1000001) AND (ci_job_artifacts.id < 1001000) AND (ci_job_artifacts.created_at > '2019-11-24 14:57:34.360379+00'::timestamp with time zone))
Filter: ((ci_job_artifacts.expire_at IS NULL) AND (date(timezone('UTC'::text, ci_job_artifacts.created_at)) < '2020-06-22'::date))
Rows Removed by Filter: 0
Buffers: shared hit=4
Time: 0.507 ms
- planning: 0.362 ms
- execution: 0.145 ms
- I/O read: N/A
- I/O write: N/A
Shared buffers:
- hits: 4 (~32.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/2393/commands/7337
Related https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10177
Does this MR meet the acceptance criteria?
Conformity
Edited by Matija Čupić