Skip to content

Backfill the `namespace_id` of the existing `vulnerability_reads` records

What does this MR do and why?

Related to [Proposal] Add `namespace_id` to `vulnerability... (#367300 - closed).

Database review

This MR introduces a post-deployment migration to schedule the background jobs.

Output of the post-deployment migration

rake db:migrate:up

main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: migrating
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: migrated (0.1035s)

rake db:migrate:down

main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: reverting
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: reverted (0.0374s)
Update query used in background migration
UPDATE
    vulnerability_reads
SET
    namespace_id = sub_query.namespace_id
FROM (
    SELECT
        vulnerability_reads.vulnerability_id,
        projects.namespace_id
    FROM
        "vulnerability_reads"
        INNER JOIN projects ON projects.id = vulnerability_reads.project_id
    WHERE
        "vulnerability_reads"."vulnerability_id" BETWEEN 1 AND 11513
        AND "vulnerability_reads"."vulnerability_id" >= 1
        AND "vulnerability_reads"."vulnerability_id" < 812) AS sub_query
WHERE
    vulnerability_reads.vulnerability_id = sub_query.vulnerability_id
 ModifyTable on public.vulnerability_reads  (cost=1.57..5889.54 rows=773 width=173) (actual time=20.233..20.236 rows=0 loops=1)
   Buffers: shared hit=4820 read=19 dirtied=19
   I/O Timings: read=12.146 write=0.000
   ->  Nested Loop  (cost=1.57..5889.54 rows=773 width=173) (actual time=0.152..2.133 rows=200 loops=1)
         Buffers: shared hit=2137
         I/O Timings: read=0.000 write=0.000
         ->  Nested Loop  (cost=1.13..3219.89 rows=773 width=167) (actual time=0.116..1.331 rows=200 loops=1)
               Buffers: shared hit=1334
               I/O Timings: read=0.000 write=0.000
               ->  Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads vulnerability_reads_1  (cost=0.56..452.12 rows=773 width=22) (actual time=0.110..0.550 rows=200 loops=1)
                     Index Cond: ((vulnerability_reads_1.vulnerability_id >= 1) AND (vulnerability_reads_1.vulnerability_id <= 11513) AND (vulnerability_reads_1.vulnerability_id >= 1) AND (vulnerability_reads_1.vulnerability_id < 812))
                     Buffers: shared hit=210
                     I/O Timings: read=0.000 write=0.000
               ->  Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads  (cost=0.56..3.58 rows=1 width=153) (actual time=0.003..0.003 rows=1 loops=200)
                     Index Cond: (vulnerability_reads.vulnerability_id = vulnerability_reads_1.vulnerability_id)
                     Buffers: shared hit=1062
                     I/O Timings: read=0.000 write=0.000
         ->  Index Scan using projects_pkey on public.projects  (cost=0.44..3.45 rows=1 width=14) (actual time=0.003..0.003 rows=1 loops=200)
               Index Cond: (projects.id = vulnerability_reads_1.project_id)
               Buffers: shared hit=803
               I/O Timings: read=0.000 write=0.000

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/11159/commands/39930

Time estimations

As of writing, there are 47_776_552 records in the vulnerability_reads table. This means that we will create 4_778 jobs with a gap of 2 minutes in between each, 4778 * 2 / 60 / 24 = ~7 days.

Since the update query runs in around 30ms, with the sub batch size of 200, we will be running 50 update queries for each background job so 50 * 30 = 1.5 seconds. With the additional overheads of executing each iteration, it's really likely to stay in 2 minutes for each job.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Mehmet Emin INAC

Merge request reports