Backfill the `namespace_id` of the existing `vulnerability_reads` records (!93110) · Merge requests · GitLab.org / GitLab

Mehmet Emin INAC requested to merge 367300_backfill_the_namespace_id_for_vulnerability_reads into master Jul 22, 2022

What does this MR do and why?

Database review

This MR introduces a post-deployment migration to schedule the background jobs.

Output of the post-deployment migration

rake db:migrate:up

main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: migrating
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: migrated (0.1035s)

rake db:migrate:down

main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: reverting
main: == 20220722145845 ScheduleBackfillingTheNamespaceIdForVulnerabilityReads: reverted (0.0374s)

Update query used in background migration

UPDATE
    vulnerability_reads
SET
    namespace_id = sub_query.namespace_id
FROM (
    SELECT
        vulnerability_reads.vulnerability_id,
        projects.namespace_id
    FROM
        "vulnerability_reads"
        INNER JOIN projects ON projects.id = vulnerability_reads.project_id
    WHERE
        "vulnerability_reads"."vulnerability_id" BETWEEN 1 AND 11513
        AND "vulnerability_reads"."vulnerability_id" >= 1
        AND "vulnerability_reads"."vulnerability_id" < 812) AS sub_query
WHERE
    vulnerability_reads.vulnerability_id = sub_query.vulnerability_id

 ModifyTable on public.vulnerability_reads  (cost=1.57..5889.54 rows=773 width=173) (actual time=20.233..20.236 rows=0 loops=1)
   Buffers: shared hit=4820 read=19 dirtied=19
   I/O Timings: read=12.146 write=0.000
   ->  Nested Loop  (cost=1.57..5889.54 rows=773 width=173) (actual time=0.152..2.133 rows=200 loops=1)
         Buffers: shared hit=2137
         I/O Timings: read=0.000 write=0.000
         ->  Nested Loop  (cost=1.13..3219.89 rows=773 width=167) (actual time=0.116..1.331 rows=200 loops=1)
               Buffers: shared hit=1334
               I/O Timings: read=0.000 write=0.000
               ->  Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads vulnerability_reads_1  (cost=0.56..452.12 rows=773 width=22) (actual time=0.110..0.550 rows=200 loops=1)
                     Index Cond: ((vulnerability_reads_1.vulnerability_id >= 1) AND (vulnerability_reads_1.vulnerability_id <= 11513) AND (vulnerability_reads_1.vulnerability_id >= 1) AND (vulnerability_reads_1.vulnerability_id < 812))
                     Buffers: shared hit=210
                     I/O Timings: read=0.000 write=0.000
               ->  Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads  (cost=0.56..3.58 rows=1 width=153) (actual time=0.003..0.003 rows=1 loops=200)
                     Index Cond: (vulnerability_reads.vulnerability_id = vulnerability_reads_1.vulnerability_id)
                     Buffers: shared hit=1062
                     I/O Timings: read=0.000 write=0.000
         ->  Index Scan using projects_pkey on public.projects  (cost=0.44..3.45 rows=1 width=14) (actual time=0.003..0.003 rows=1 loops=200)
               Index Cond: (projects.id = vulnerability_reads_1.project_id)
               Buffers: shared hit=803
               I/O Timings: read=0.000 write=0.000

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/11159/commands/39930

Time estimations

As of writing, there are 47_776_552 records in the vulnerability_reads table. This means that we will create 4_778 jobs with a gap of 2 minutes in between each, 4778 * 2 / 60 / 24 = ~7 days.

Since the update query runs in around 30ms, with the sub batch size of 200, we will be running 50 update queries for each background job so 50 * 30 = 1.5 seconds. With the additional overheads of executing each iteration, it's really likely to stay in 2 minutes for each job.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Jul 25, 2022 by Mehmet Emin INAC

Backfill the `namespace_id` of the existing `vulnerability_reads` records

What does this MR do and why?

Database review

MR acceptance checklist

Merge request reports