Skip to content

Replace each_batch PruneExpiredExportJobsService with scalable queries

What does this MR do and why?

This MR removes the plain .each_batch queries added in Prune old RelationExport and RelationExportUplo... (!142246 - merged) after a database review noted that it wouldn't scale well for high volumes of data. Instead, a new index on project_export_uploads for updated_at and id was added to efficiently delete batches of ProjectExportJobs in a loop. Then, keyset-based loop was implemented to delete Uploads associated with the expired export jobs.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Same as Prune old RelationExport and RelationExportUplo... (!142246 - merged). No actual functionality was changed.

Queries

deleted_count = ProjectExportJob.prunable.limit(1000).delete_all

Raw SQL:

DELETE FROM project_export_jobs WHERE project_export_jobs.id IN (SELECT project_export_jobs.id FROM project_export_jobs WHERE (updated_at < '2024-02-19 17:15:14.211858') LIMIT 1000)

Query plan without new index.

prunable_uploads.delete_all

Raw SQL without keyset iteration:

SELECT "uploads".* FROM "uploads" WHERE "uploads"."model_type" = 'Projects::ImportExport::RelationExportUpload' AND "uploads"."model_id" IN (SELECT "project_relation_export_uploads"."id" FROM "project_relation_export_uploads" INNER JOIN "project_relation_exports" "relation_export" ON "relation_export"."id" = "project_relation_export_uploads"."project_relation_export_id" WHERE "relation_export"."project_export_job_id" IN (SELECT "project_export_jobs"."id" FROM "project_export_jobs" WHERE (updated_at < '2024-02-19 17:15:14.211858')));

Query execution plan. This plan should be the same as the MR that introduced it, however, this time it uses keyset pagination instead of EachBatch.

Related to #441698 (closed)

Edited by Sam Word

Merge request reports