Add migration to prevent trace expiration on .com (!79547) · Merge requests · GitLab.org / GitLab

drew stachon requested to merge clear-expire-at-migrations-dot-com into master Jan 31, 2022

What does this MR do and why?

This migration adds two migrations, one background migration and another to schedule it. This is the main migration to address #322817 (closed).

We backfilled the expire_at column in ci_job_artifacts, but didn't do this with any respect to the file_type. This includes job traces, which we we have never had a policy of removing, and so should not without telling anyone in advance.

To remedy this, we'll be erasing the expire_at timestamp for every job trace on gitlab.com that isn't already null. This will correct for all the trace on gitlab.com that were written to in !47723 (merged) which originally wrote these timestamps.

Using a partial index on ci_job_artifacts (in this MR), we spend a few hours building an index and then schedule about 20% fewer jobs, which should ultimately save a couple days of execution time.

SELECT count(*) FROM legacy.gitlab_dotcom_ci_job_artifacts
  WHERE expire_at IS NOT NULL
  AND file_type = 3

> 379198476

SELECT count(*)
  FROM legacy.gitlab_dotcom_ci_job_artifacts
  WHERE expire_at IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00')
  AND file_type = 3
  AND (created_at < '2020-06-22'::date) 
  AND EXTRACT(SECOND FROM expire_at) = 0

> 379198476

SELECT count(*)
  FROM legacy.gitlab_dotcom_ci_job_artifacts
  WHERE expire_at IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00')

> 474348785

I've also discovered that there are exactly 7 distinct expire_at values that we need to erase:

SELECT DATE(expire_at), locked, count(*)
  FROM legacy.gitlab_dotcom_ci_job_artifacts
  WHERE expire_at IS NOT NULL
    AND file_type = 3
    AND created_at < DATE('2020-06-22')
  GROUP BY DATE(expire_at), locked
  ORDER BY DATE(expire_at) ASC

DATE(EXPIRE_AT)	LOCKED	COUNT(*)
2021-04-22	2	142,872,968
2021-05-22	2	50,621,292
2021-06-22	2	3,709,294
2022-01-22	2	104,867,260
2022-02-22	2	71,916,299
2022-03-22	2	4,899,297
2022-04-22	2	312,066
∑		379,198,476

Query Plans

Scheduling Migration

Since we can relatively quickly (2-3 hours in databaselab, expected to be faster in production) create an index for records with file_type = 3 AND expire_at IN (...) one of the targeted timestamps, doing that first will reduce the number of batched migrations we need to run and save us (roughly) 3-4 days of background migration time.

SELECT id FROM ci_job_artifacts WHERE file_type = 3 AND expire_at IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00') AND id >= 11 ORDER BY id ASC LIMIT 1 OFFSET 100000

Background Migration to update `expire_at`

There's two queries that fire for every batch of 1000 artfacts, of which there are 100 in each scheduled migration interval (100,000 ids per interval in scheduling migration)

First query (part of each batch):

The first is a pretty fast index scan with the timestamp filter to select a single ID, similar to the scheduling query. I don't see any issues with this.

SELECT "ci_job_artifacts"."id" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."file_type" = 3 AND "ci_job_artifacts"."id" BETWEEN 2962748 AND 3062748 AND "ci_job_artifacts"."expire_at" IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00') AND "ci_job_artifacts"."id" >= 2962748 ORDER BY "ci_job_artifacts"."id" ASC LIMIT 1 OFFSET 1000

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8503/commands/30162

Second query, updating the batch

The second query takes the longest (~2s) because it does the heavy lifting of actually writing null to the expire_at column of up to 1000 rows at a time. While this is not what I'd consider a fast query, I'm not sure that there's a better way to do it. Adding the temporary index has not helped.

Here are the query plans for the first three batch updates. Using the index, and still taking a few seconds each. Note that despite 0 rows being removed by the filter, we can't remove the conditions because they are what convinces the query planner to use our temporary index in the first place.

UPDATE "ci_job_artifacts" SET "expire_at" = NULL WHERE "ci_job_artifacts"."file_type" = 3 AND "ci_job_artifacts"."id" BETWEEN 2962748 AND 3062748 AND "ci_job_artifacts"."expire_at" IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00') AND "ci_job_artifacts"."id" >= 2962748 AND "ci_job_artifacts"."id" < 2963748

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8662/commands/30788

UPDATE "ci_job_artifacts" SET "expire_at" = NULL WHERE "ci_job_artifacts"."file_type" = 3 AND "ci_job_artifacts"."id" BETWEEN 2962748 AND 3062748 AND "ci_job_artifacts"."expire_at" IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00') AND "ci_job_artifacts"."id" >= 2963748 AND "ci_job_artifacts"."id" < 2964748

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8662/commands/30789

UPDATE "ci_job_artifacts" SET "expire_at" = NULL WHERE "ci_job_artifacts"."file_type" = 3 AND "ci_job_artifacts"."id" BETWEEN 2962748 AND 3062748 AND "ci_job_artifacts"."expire_at" IN ('2021-04-22 00:00:00', '2021-05-22 00:00:00', '2021-06-22 00:00:00', '2022-01-22 00:00:00', '2022-02-22 00:00:00', '2022-03-22 00:00:00', '2022-04-22 00:00:00') AND "ci_job_artifacts"."id" >= 2964748 AND "ci_job_artifacts"."id" < 2965748

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8662/commands/30790

They all take a couple seconds, varying from 1.5s to 4.5s. I'm not sure this is going to get any better, so I'll just watch the background migration and pause it if batches start stacking up.

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Feb 16, 2022 by drew stachon

Add migration to prevent trace expiration on .com