Skip to content

Prune old RelationExport and RelationExportUpload records

What does this MR do and why?

This MR prunes expired ProjectExportUploads, including upload files. Previously, just ProjectExportUploads records were pruned along with Projects::ImportExport::RelationExports and Projects::ImportExport::RelationExportUploads via ON DELETE CASCADE. However, export files remained. This MR addresses those remaining files.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

No visible changes

Queries

1. ProjectExportJob.prunable delete_all

This query was covered in the original MR to create this job, but the query is now executed in the new PruneExpiredExportJobsService instead of existing in a model method. Here's the query explanation anyway just so it's fresh:

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26038/commands/81976

Generates SQL queries like this (IDs not real) from .each_batch

SELECT "project_export_jobs".* FROM "project_export_jobs" WHERE (updated_at < '2024-01-30 20:16:30.838492') AND "project_export_jobs"."id" >= 2592430 AND "project_export_jobs"."id" < 3693430;

Explanation:

 Index Scan using project_export_jobs_pkey on public.project_export_jobs  (cost=0.42..3.44 rows=1 width=59) (actual time=0.012..0.013 rows=0 loops=1)
   Index Cond: ((project_export_jobs.id >= 2592430) AND (project_export_jobs.id < 3693430))
   Filter: (project_export_jobs.updated_at < '2024-01-30 20:16:30.838492+00'::timestamp with time zone)
   Rows Removed by Filter: 0
   Buffers: shared hit=6
   I/O Timings: read=0.000 write=0.000

2. prunable_uploads delete_all

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26038/commands/81996

Generated SQL (without each_batch):

SELECT "uploads".* FROM "uploads" WHERE "uploads"."model_type" = 'Projects::ImportExport::RelationExportUpload' AND "uploads"."model_id" IN (SELECT "project_relation_export_uploads"."id" FROM "project_relation_export_uploads" INNER JOIN "project_relation_exports" "relation_export" ON "relation_export"."id" = "project_relation_export_uploads"."project_relation_export_id" WHERE "relation_export"."project_export_job_id" IN (SELECT "project_export_jobs"."id" FROM "project_export_jobs" WHERE (updated_at < '2024-01-30 21:18:11.186764')));

Explanation:

 Nested Loop  (cost=4.02..7.06 rows=1 width=281) (actual time=0.008..0.010 rows=0 loops=1)
   I/O Timings: read=0.000 write=0.000
   ->  HashAggregate  (cost=3.46..3.47 rows=1 width=8) (actual time=0.008..0.009 rows=0 loops=1)
         Group Key: project_relation_export_uploads.id
         I/O Timings: read=0.000 write=0.000
         ->  Nested Loop  (cost=0.42..3.45 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1)
               I/O Timings: read=0.000 write=0.000
               ->  Nested Loop  (cost=0.00..0.01 rows=1 width=16) (actual time=0.006..0.007 rows=0 loops=1)
                     I/O Timings: read=0.000 write=0.000
                     ->  Seq Scan on public.project_relation_export_uploads  (cost=0.00..0.00 rows=1 width=16) (actual time=0.006..0.006 rows=0 loops=1)
                           I/O Timings: read=0.000 write=0.000
                     ->  Seq Scan on public.project_relation_exports relation_export  (cost=0.00..0.00 rows=1 width=16) (actual time=0.000..0.000 rows=0 loops=0)
                           I/O Timings: read=0.000 write=0.000
               ->  Index Scan using project_export_jobs_pkey on public.project_export_jobs  (cost=0.42..3.44 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=0)
                     Index Cond: (project_export_jobs.id = relation_export.project_export_job_id)
                     Filter: (project_export_jobs.updated_at < '2024-01-30 21:18:11.186764+00'::timestamp with time zone)
                     Rows Removed by Filter: 0
                     I/O Timings: read=0.000 write=0.000
   ->  Index Scan using index_uploads_on_model_id_and_model_type on public.uploads  (cost=0.56..3.58 rows=1 width=281) (actual time=0.000..0.000 rows=0 loops=0)
         Index Cond: ((uploads.model_id = project_relation_export_uploads.id) AND ((uploads.model_type)::text = 'Projects::ImportExport::RelationExportUpload'::text))
         I/O Timings: read=0.000 write=0.000

How to set up and validate locally

  1. Make it so that the current time is at more than a week in the past and start your local environment. I find it easiest to manually set your system's time

  2. Trigger a parallel project export in the Rails console and let it complete. Do this multiple times with multiple projects if you'd like:

    project_id = # ID of Project you want to export
    user_id = # ID of the user that performs the export
    Projects::ImportExport::CreateRelationExportsWorker.perform_async(user_id, project_id)
  3. Verify the export completed, and there's a ProjectExportJob with associated RelationExports and RelationExportUploads

  4. Reset your local environment to use the actual time and date. If you'd like, run a fresh parallel project export.

  5. Run the PruneProjectExportJobsWorker:

    Gitlab::Export::PruneProjectExportJobsWorker.perform_async
  6. Verify that the old ProjectExportJobs and their associations were pruned, and any fresh ProjectExportJobs were not pruned

Related to #382885 (closed)

Edited by Sam Word

Merge request reports