Fix data integrity on issue_feedback, issue_links

This is a follow-up from https://gitlab.com/gitlab-org/gitlab/-/issues/223770#note_366994677.

Description

We need to create an Vulnerabilities::IssueLink object for every Vulnerability with associated Vulnerability::Finding that has an Vulnerability::Feedback of type issue associated with it.

Find out how many Vulnerabilities do not have an IssueLink

SQL
SELECT vulnerability.id
FROM vulnerability_issue_links vil
RIGHT JOIN vulnerabilities vulnerability
ON vil.vulnerability_id = vulnerability.id
WHERE vil.vulnerability_id IS NULL;
Plan details ``` Merge Anti Join (cost=0.57..81838.72 rows=2685650 width=8) (actual time=5.015..12359.306 rows=2682941 loops=1) Merge Cond: (vulnerability.id = vil.vulnerability_id) Buffers: shared hit=266236 read=19353 dirtied=4090 I/O Timings: read=10386.929 -> Index Only Scan using vulnerabilities_pkey on public.vulnerabilities vulnerability (cost=0.43..75114.91 rows=2685868 width=8) (actual time=4.974..11686.093 rows=2683160 loops=1) Heap Fetches: 257912 Buffers: shared hit=266233 read=19353 dirtied=4090 I/O Timings: read=10386.929 -> Index Only Scan using idx_vulnerability_issue_links_on_vulnerability_id_and_issue_id on public.vulnerability_issue_links vil (cost=0.14..6.41 rows=218 width=8) (actual time=0.036..0.292 rows=219 loops=1) Heap Fetches: 107 Buffers: shared hit=3 ```

Find out how many of those have a finding associated

SQL
SELECT vo.project_id, vo.report_type, vo.project_fingerprint
FROM vulnerability_occurrences vo
JOIN vulnerabilities vulnerability
ON vo.vulnerability_id = vulnerability.id
WHERE vulnerability_id IN (
  SELECT vulnerability.id
  FROM vulnerability_issue_links vil
  RIGHT JOIN vulnerabilities vulnerability
  ON vil.vulnerability_id = vulnerability.id
  WHERE vil.vulnerability_id IS NULL
);

Get those with an Feedback of type 'issue'

SQL
SELECT vfb.id, vfb.feedback_type, vfb.issue_id
FROM vulnerability_feedback vfb
WHERE (vfb.project_id, vfb.category, vfb.project_fingerprint) IN (
  SELECT vo.project_id, vo.report_type, encode(vo.project_fingerprint, 'hex')
  FROM vulnerability_occurrences vo
  JOIN vulnerabilities vulnerability
  ON vo.vulnerability_id = vulnerability.id
  WHERE vulnerability_id IN (
    SELECT vulnerability.id
    FROM vulnerability_issue_links vil
    RIGHT JOIN vulnerabilities vulnerability
    ON vil.vulnerability_id = vulnerability.id
    WHERE vil.vulnerability_id IS NULL
  )
) AND vfb.feedback_type = 1 AND vfb.issue_id IS NOT NULL;
Plan details
 Nested Loop Semi Join  (cost=1.85..62759.89 rows=40 width=10) (actual time=612.653..132348.826 rows=568 loops=1)
   Buffers: shared hit=606990 read=116720 dirtied=2000
   I/O Timings: read=130175.512
   ->  Index Scan using index_vulnerability_feedback_on_issue_id on public.vulnerability_feedback vfb  (cost=0.29..194.36 rows=80 width=56) (actual time=2.858..270.883 rows=1401 loops=1)
         Index Cond: (vfb.issue_id IS NOT NULL)
         Filter: (vfb.feedback_type = 1)
         Rows Removed by Filter: 0
         Buffers: shared hit=612 read=301 dirtied=15
         I/O Timings: read=256.589
   ->  Nested Loop  (cost=1.56..782.06 rows=1 width=27) (actual time=94.271..94.271 rows=0 loops=1401)
         Buffers: shared hit=606378 read=116419 dirtied=1985
         I/O Timings: read=129918.923
         ->  Nested Loop Semi Join  (cost=1.13..781.60 rows=1 width=43) (actual time=94.269..94.269 rows=0 loops=1401)
               Buffers: shared hit=604365 read=116419 dirtied=1985
               I/O Timings: read=129918.923
               ->  Index Scan using index_vulnerability_occurrences_on_unique_keys on public.vulnerability_occurrences vo  (cost=0.56..780.96 rows=1 width=35) (actual time=53.340..94.192 rows=1 loops=1401)
                     Index Cond: (vo.project_id = vfb.project_id)
                     Filter: ((vfb.category = vo.report_type) AND ((vfb.project_fingerprint)::text = encode(vo.project_fingerprint, 'hex'::text)))
                     Rows Removed by Filter: 522
                     Buffers: shared hit=600815 read=116374 dirtied=1984
                     I/O Timings: read=129843.232
               ->  Nested Loop Anti Join  (cost=0.57..0.63 rows=1 width=8) (actual time=0.071..0.071 rows=0 loops=1428)
                     Buffers: shared hit=3549 read=45
                     I/O Timings: read=75.691
                     ->  Index Only Scan using vulnerabilities_pkey on public.vulnerabilities vulnerability_1  (cost=0.43..0.45 rows=1 width=8) (actual time=0.066..0.066 rows=1 loops=1428)
                           Index Cond: (vulnerability_1.id = vo.vulnerability_id)
                           Heap Fetches: 115
                           Buffers: shared hit=2681 read=45
                           I/O Timings: read=75.691
                     ->  Index Only Scan using idx_vulnerability_issue_links_on_vulnerability_id_and_issue_id on public.vulnerability_issue_links vil  (cost=0.14..0.16 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=770)
                           Index Cond: (vil.vulnerability_id = vulnerability_1.id)
                           Heap Fetches: 97
                           Buffers: shared hit=868
         ->  Index Only Scan using vulnerabilities_pkey on public.vulnerabilities vulnerability  (cost=0.43..0.45 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=568)
               Index Cond: (vulnerability.id = vo.vulnerability_id)
               Heap Fetches: 78
               Buffers: shared hit=2013
Execution summary from #database-lab
Time: 2.206 min
  - planning: 16.345 ms
  - execution: 2.206 min
    - I/O read: 2.170 min
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 606990 (~4.60 GiB) from the buffer pool
  - reads: 116720 (~911.90 MiB) from the OS file cache, including disk I/O
  - dirtied: 2000 (~15.60 MiB)
  - writes: 0

These queries are intended to be readable, not fast so some (or much) tweaking might be required

Implementation Plan

Discussed with Adam Hegyi on Slack

  1. Create an index for (project_id, report_type, encode(project_fingerprint, 'hex'::text))
  2. We have only 24k Vulnerabilities::Record so we should start from there and use EachBatch
Edited by Michał Zając