Backfill the `namespace_id` of the existing `vulnerability_reads` records
What does this MR do and why?
Related to [Proposal] Add `namespace_id` to `vulnerability... (#367300 - closed).
Database review
This MR introduces a post-deployment migration to schedule the background jobs.
Output of the post-deployment migration
rake db:migrate:up
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: migrating
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: migrated (0.1035s)
rake db:migrate:down
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: reverting
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: reverted (0.0374s)
Update query used in background migration
UPDATE
vulnerability_reads
SET
namespace_id = sub_query.namespace_id
FROM (
SELECT
vulnerability_reads.vulnerability_id,
projects.namespace_id
FROM
"vulnerability_reads"
INNER JOIN projects ON projects.id = vulnerability_reads.project_id
WHERE
"vulnerability_reads"."vulnerability_id" BETWEEN 1 AND 11513
AND "vulnerability_reads"."vulnerability_id" >= 1
AND "vulnerability_reads"."vulnerability_id" < 812) AS sub_query
WHERE
vulnerability_reads.vulnerability_id = sub_query.vulnerability_id
ModifyTable on public.vulnerability_reads (cost=1.57..5889.54 rows=773 width=173) (actual time=20.233..20.236 rows=0 loops=1)
Buffers: shared hit=4820 read=19 dirtied=19
I/O Timings: read=12.146 write=0.000
-> Nested Loop (cost=1.57..5889.54 rows=773 width=173) (actual time=0.152..2.133 rows=200 loops=1)
Buffers: shared hit=2137
I/O Timings: read=0.000 write=0.000
-> Nested Loop (cost=1.13..3219.89 rows=773 width=167) (actual time=0.116..1.331 rows=200 loops=1)
Buffers: shared hit=1334
I/O Timings: read=0.000 write=0.000
-> Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads vulnerability_reads_1 (cost=0.56..452.12 rows=773 width=22) (actual time=0.110..0.550 rows=200 loops=1)
Index Cond: ((vulnerability_reads_1.vulnerability_id >= 1) AND (vulnerability_reads_1.vulnerability_id <= 11513) AND (vulnerability_reads_1.vulnerability_id >= 1) AND (vulnerability_reads_1.vulnerability_id < 812))
Buffers: shared hit=210
I/O Timings: read=0.000 write=0.000
-> Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads (cost=0.56..3.58 rows=1 width=153) (actual time=0.003..0.003 rows=1 loops=200)
Index Cond: (vulnerability_reads.vulnerability_id = vulnerability_reads_1.vulnerability_id)
Buffers: shared hit=1062
I/O Timings: read=0.000 write=0.000
-> Index Scan using projects_pkey on public.projects (cost=0.44..3.45 rows=1 width=14) (actual time=0.003..0.003 rows=1 loops=200)
Index Cond: (projects.id = vulnerability_reads_1.project_id)
Buffers: shared hit=803
I/O Timings: read=0.000 write=0.000
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/11159/commands/39930
Time estimations
As of writing, there are 47_776_552 records in the vulnerability_reads
table. This means that we will create 4_778 jobs with a gap of 2 minutes in between each, 4778 * 2 / 60 / 24 = ~7 days
.
Since the update query runs in around 30ms, with the sub batch size of 200, we will be running 50 update queries for each background job so 50 * 30 = 1.5 seconds
. With the additional overheads of executing each iteration, it's really likely to stay in 2 minutes for each job.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.