Fix data integrity on issue_feedback, issue_links
This is a follow-up from https://gitlab.com/gitlab-org/gitlab/-/issues/223770#note_366994677.
Description
We need to create an Vulnerabilities::IssueLink
object for every Vulnerability
with associated Vulnerability::Finding
that has an Vulnerability::Feedback
of type issue
associated with it.
Find out how many Vulnerabilities do not have an IssueLink
SQL
SELECT vulnerability.id
FROM vulnerability_issue_links vil
RIGHT JOIN vulnerabilities vulnerability
ON vil.vulnerability_id = vulnerability.id
WHERE vil.vulnerability_id IS NULL;
Plan details
``` Merge Anti Join (cost=0.57..81838.72 rows=2685650 width=8) (actual time=5.015..12359.306 rows=2682941 loops=1) Merge Cond: (vulnerability.id = vil.vulnerability_id) Buffers: shared hit=266236 read=19353 dirtied=4090 I/O Timings: read=10386.929 -> Index Only Scan using vulnerabilities_pkey on public.vulnerabilities vulnerability (cost=0.43..75114.91 rows=2685868 width=8) (actual time=4.974..11686.093 rows=2683160 loops=1) Heap Fetches: 257912 Buffers: shared hit=266233 read=19353 dirtied=4090 I/O Timings: read=10386.929 -> Index Only Scan using idx_vulnerability_issue_links_on_vulnerability_id_and_issue_id on public.vulnerability_issue_links vil (cost=0.14..6.41 rows=218 width=8) (actual time=0.036..0.292 rows=219 loops=1) Heap Fetches: 107 Buffers: shared hit=3 ```Find out how many of those have a finding associated
SQL
SELECT vo.project_id, vo.report_type, vo.project_fingerprint
FROM vulnerability_occurrences vo
JOIN vulnerabilities vulnerability
ON vo.vulnerability_id = vulnerability.id
WHERE vulnerability_id IN (
SELECT vulnerability.id
FROM vulnerability_issue_links vil
RIGHT JOIN vulnerabilities vulnerability
ON vil.vulnerability_id = vulnerability.id
WHERE vil.vulnerability_id IS NULL
);
Get those with an Feedback of type 'issue'
SQL
SELECT vfb.id, vfb.feedback_type, vfb.issue_id
FROM vulnerability_feedback vfb
WHERE (vfb.project_id, vfb.category, vfb.project_fingerprint) IN (
SELECT vo.project_id, vo.report_type, encode(vo.project_fingerprint, 'hex')
FROM vulnerability_occurrences vo
JOIN vulnerabilities vulnerability
ON vo.vulnerability_id = vulnerability.id
WHERE vulnerability_id IN (
SELECT vulnerability.id
FROM vulnerability_issue_links vil
RIGHT JOIN vulnerabilities vulnerability
ON vil.vulnerability_id = vulnerability.id
WHERE vil.vulnerability_id IS NULL
)
) AND vfb.feedback_type = 1 AND vfb.issue_id IS NOT NULL;
Plan details
Nested Loop Semi Join (cost=1.85..62759.89 rows=40 width=10) (actual time=612.653..132348.826 rows=568 loops=1)
Buffers: shared hit=606990 read=116720 dirtied=2000
I/O Timings: read=130175.512
-> Index Scan using index_vulnerability_feedback_on_issue_id on public.vulnerability_feedback vfb (cost=0.29..194.36 rows=80 width=56) (actual time=2.858..270.883 rows=1401 loops=1)
Index Cond: (vfb.issue_id IS NOT NULL)
Filter: (vfb.feedback_type = 1)
Rows Removed by Filter: 0
Buffers: shared hit=612 read=301 dirtied=15
I/O Timings: read=256.589
-> Nested Loop (cost=1.56..782.06 rows=1 width=27) (actual time=94.271..94.271 rows=0 loops=1401)
Buffers: shared hit=606378 read=116419 dirtied=1985
I/O Timings: read=129918.923
-> Nested Loop Semi Join (cost=1.13..781.60 rows=1 width=43) (actual time=94.269..94.269 rows=0 loops=1401)
Buffers: shared hit=604365 read=116419 dirtied=1985
I/O Timings: read=129918.923
-> Index Scan using index_vulnerability_occurrences_on_unique_keys on public.vulnerability_occurrences vo (cost=0.56..780.96 rows=1 width=35) (actual time=53.340..94.192 rows=1 loops=1401)
Index Cond: (vo.project_id = vfb.project_id)
Filter: ((vfb.category = vo.report_type) AND ((vfb.project_fingerprint)::text = encode(vo.project_fingerprint, 'hex'::text)))
Rows Removed by Filter: 522
Buffers: shared hit=600815 read=116374 dirtied=1984
I/O Timings: read=129843.232
-> Nested Loop Anti Join (cost=0.57..0.63 rows=1 width=8) (actual time=0.071..0.071 rows=0 loops=1428)
Buffers: shared hit=3549 read=45
I/O Timings: read=75.691
-> Index Only Scan using vulnerabilities_pkey on public.vulnerabilities vulnerability_1 (cost=0.43..0.45 rows=1 width=8) (actual time=0.066..0.066 rows=1 loops=1428)
Index Cond: (vulnerability_1.id = vo.vulnerability_id)
Heap Fetches: 115
Buffers: shared hit=2681 read=45
I/O Timings: read=75.691
-> Index Only Scan using idx_vulnerability_issue_links_on_vulnerability_id_and_issue_id on public.vulnerability_issue_links vil (cost=0.14..0.16 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=770)
Index Cond: (vil.vulnerability_id = vulnerability_1.id)
Heap Fetches: 97
Buffers: shared hit=868
-> Index Only Scan using vulnerabilities_pkey on public.vulnerabilities vulnerability (cost=0.43..0.45 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=568)
Index Cond: (vulnerability.id = vo.vulnerability_id)
Heap Fetches: 78
Buffers: shared hit=2013
Execution summary from #database-lab
Time: 2.206 min
- planning: 16.345 ms
- execution: 2.206 min
- I/O read: 2.170 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 606990 (~4.60 GiB) from the buffer pool
- reads: 116720 (~911.90 MiB) from the OS file cache, including disk I/O
- dirtied: 2000 (~15.60 MiB)
- writes: 0
These queries are intended to be readable, not fast so some (or much) tweaking might be required
Implementation Plan
Discussed with Adam Hegyi on Slack
- Create an index for
(project_id, report_type, encode(project_fingerprint, 'hex'::text))
- We have only 24k
Vulnerabilities::Record
so we should start from there and useEachBatch
Edited by Michał Zając