Optimise Vulnerability Auto-Resolving with In-Operator Optimisation
requested to merge schmil-Optimise-SAST-Vulnerability-Auto-Resolving-with-In-Operator-Optimisation into master
-
Please check this box if this contribution uses AI-generated content (including content generated by GitLab Duo features) as outlined in the GitLab DCO & CLA
What does this MR do and why?
Problem
The current implementation of the vulnerability_ids_for
method uses a standard active record query, which can be computationally intensive
Solution
Use Batch Iterator of the InOperatorOptimization
module
How to set up and validate locally
- import Security Reports repository
- Remove a vulnerability from one of the security reports and merge the change
DB
Before the change
Link to postgres.ai
Query
SELECT
*,
vulnerability_occurrences.vulnerability_id AS vulnerability_id
FROM
(
WITH recursive "array_cte" AS materialized (
SELECT
"vulnerability_identifiers"."id"
FROM
"vulnerability_identifiers"
WHERE
"vulnerability_identifiers"."id" IN (
391443761, 579797979, 4008437, 521825430,
1170120, 1556418, 411142512, 3444484,
370052924, 857033, 2332695, 391443762,
3444500, 997628, 1224924, 2316204,
504345345, 16781253, 3971429, 1737876,
6994063, 3444497, 1360628451, 4764,
45949332, 922077050, 120051246, 1212213299,
34172902, 370052925, 158639479, 3093338,
109987259, 45141142, 125204793, 51296492,
62216501, 563060, 1432291701, 3305200,
1403804268, 903711, 1224922, 5475553,
1469903, 1102751, 2239821, 776213981,
499539, 1867682, 1030292425, 60737826,
1098381, 117401895, 77338273, 579797980,
1377423776, 997620, 1693846, 19186429,
1212214255, 464943, 3305208, 155954740,
4626888, 3093341, 1158290344, 4766,
1170122, 1115905169, 136013002, 2316205,
158618668, 857030, 4008443, 4117916,
391443767, 25552728, 483798999, 2239810,
579797981, 3001697, 4008430, 85688157,
1158290346, 480228523, 1211583931,
195434428, 1186811285, 3589711, 1211575,
180760, 511462153, 2916, 2912, 4008446,
1451111, 320639382, 1451108, 4772
)
),
"recursive_keyset_cte" AS (
(
SELECT
NULL :: vulnerability_identifiers AS records,
array_cte_id_array,
vulnerability_identifiers_id_array,
0 :: bigint AS count
FROM
(
SELECT
array_agg("array_cte"."id") AS array_cte_id_array,
array_agg(
"vulnerability_identifiers"."id"
) AS vulnerability_identifiers_id_array
FROM
(
SELECT
"array_cte"."id"
FROM
array_cte
) array_cte
LEFT JOIN lateral (
SELECT
"vulnerability_identifiers"."id" AS id
FROM
"vulnerability_identifiers"
WHERE
"vulnerability_identifiers"."id" = "array_cte"."id"
ORDER BY
"vulnerability_identifiers"."id" ASC
limit
1
) vulnerability_identifiers ON true
WHERE
"vulnerability_identifiers"."id" IS NOT NULL
) array_scope_lateral_query
limit
1
)
UNION ALL
(
SELECT
(
SELECT
vulnerability_identifiers
FROM
"vulnerability_identifiers"
WHERE
"vulnerability_identifiers"."id" = recursive_keyset_cte.vulnerability_identifiers_id_array[position]
limit
1
),
array_cte_id_array,
recursive_keyset_cte.vulnerability_identifiers_id_array[ : position_query.position - 1] || next_cursor_values.id || recursive_keyset_cte.vulnerability_identifiers_id_array[position_query.position + 1 : ],
recursive_keyset_cte.count + 1
FROM
recursive_keyset_cte,
lateral (
SELECT
id,
position
FROM
unnest(
vulnerability_identifiers_id_array
) WITH ordinality AS u(id, position)
WHERE
id IS NOT NULL
ORDER BY
1 ASC
limit
1
) AS position_query,
lateral (
SELECT
"record"."id"
FROM
(
VALUES
(NULL)
) AS nulls
LEFT JOIN (
SELECT
"vulnerability_identifiers"."id" AS id
FROM
"vulnerability_identifiers"
WHERE
"vulnerability_identifiers"."id" = recursive_keyset_cte.array_cte_id_array[position]
AND (
"vulnerability_identifiers"."id" > recursive_keyset_cte.vulnerability_identifiers_id_array[position]
)
ORDER BY
"vulnerability_identifiers"."id" ASC
limit
1
) record ON true
limit
1
) AS next_cursor_values
)
)
SELECT
(records).*
FROM
"recursive_keyset_cte" AS "vulnerability_identifiers"
WHERE
(count <> 0)
) vulnerability_identifiers
INNER JOIN "vulnerability_occurrences" ON "vulnerability_occurrences"."primary_identifier_id" = "vulnerability_identifiers"."id"
limit
25
Plan
Nested Loop (cost=1.00..1999.89 rows=659 width=16)
-> Index Only Scan using vulnerability_identifiers_pkey on vulnerability_identifiers (cost=0.43..210.12 rows=100 width=8)
Index Cond: (id = ANY ('{391443761,579797979,4008437,521825430,1170120,1556418,411142512,3444484,370052924,857033,2332695,391443762,3444500,997628,1224924,2316204,504345345,16781253,3971429,1737876,6994063,3444497,1360628451,4764,45949332,922077050,120051246,1212213299,34172902,370052925,158639479,3093338,109987259,45141142,125204793,51296492,62216501,563060,1432291701,3305200,1403804268,903711,1224922,5475553,1469903,1102751,2239821,776213981,499539,1867682,1030292425,60737826,1098381,117401895,77338273,579797980,1377423776,997620,1693846,19186429,1212214255,464943,3305208,155954740,4626888,3093341,1158290344,4766,1170122,1115905169,136013002,2316205,158618668,857030,4008443,4117916,391443767,25552728,483798999,2239810,579797981,3001697,4008430,85688157,1158290346,480228523,1211583931,195434428,1186811285,3589711,1211575,180760,511462153,2916,2912,4008446,1451111,320639382,1451108,4772}'::bigint[]))
-> Index Only Scan using index_vulnerability_occurrences_prim_iden_id_and_vuln_id on vulnerability_occurrences (cost=0.57..13.09 rows=481 width=16)
Index Cond: (primary_identifier_id = vulnerability_identifiers.id)
Statistics
Time: 846.881 ms
- planning: 5.034 ms
- execution: 841.847 ms
- I/O read: 825.560 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 743 (~5.80 MiB) from the buffer pool
- reads: 658 (~5.10 MiB) from the OS file cache, including disk I/O
- dirtied: 61 (~488.00 KiB)
- writes: 0
After the Change
Query
After trying out some optimization options we chose the simple approach i.e.: iterating over identifiers one by one and processing the vulnerabilities by running two separate queries for each identifier as follows
Using short requests will ensure no timeouts
identifier_ids.each do |identifier_id|
scope = Vulnerabilities::Finding.where(primary_identifier_id: identifier_id).order(order)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
.each_batch(of: VULNERABILITY_IDENTIFIERS_BATCH_SIZE) do |records|
vulnerability_ids = records.map(&:vulnerability_id)
yield ::Vulnerability.id_in(vulnerability_ids).with_states(:detected).with_resolution(true)
end
end
SQL query for vulnerability_ids = records.map(&:vulnerability_id)
Link to postgres.ai
SELECT
"vulnerability_occurrences".*
FROM
"vulnerability_occurrences"
WHERE
"vulnerability_occurrences"."primary_identifier_id" = 1212213299
ORDER BY
"vulnerability_occurrences"."vulnerability_id" ASC
LIMIT
250
Plan
Limit (cost=0.57..355.96 rows=250 width=1426) (actual time=96.821..96.823 rows=0 loops=1)
Buffers: shared hit=4 read=3
I/O Timings: read=96.663 write=0.000
-> Index Scan using index_vulnerability_occurrences_prim_iden_id_and_vuln_id on public.vulnerability_occurrences (cost=0.57..500.96 rows=352 width=1426) (actual time=96.818..96.819 rows=0 loops=1)
Index Cond: (vulnerability_occurrences.primary_identifier_id = 1212213299)
Buffers: shared hit=4 read=3
I/O Timings: read=96.663 write=0.000
Statistics
Time: 194.719 ms
- planning: 97.764 ms
- execution: 96.955 ms
- I/O read: 96.663 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 4 (~32.00 KiB) from the buffer pool
- reads: 3 (~24.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
SQL query for yield ::Vulnerability.id_in(vulnerability_ids).with_states(:detected).with_resolution(true)
Link to postgres.ai
SELECT
"vulnerabilities".*
FROM
"vulnerabilities"
WHERE
"vulnerabilities"."id" IN (9, 11)
AND "vulnerabilities"."state" = 1
AND "vulnerabilities"."resolved_on_default_branch" = TRUE
Plan
Index Scan using index_vulnerabilities_on_detected_at_and_id on public.vulnerabilities (cost=0.57..3.59 rows=1 width=453) (actual time=10.930..10.934 rows=1 loops=1)
Index Cond: (vulnerabilities.id = 98820130)
Filter: (vulnerabilities.resolved_on_default_branch AND (vulnerabilities.state = 1))
Rows Removed by Filter: 0
Buffers: shared hit=3 read=5 dirtied=1
I/O Timings: read=10.714 write=0.000
Statistics
Time: 14.205 ms
- planning: 3.186 ms
- execution: 11.019 ms
- I/O read: 10.714 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 3 (~24.00 KiB) from the buffer pool
- reads: 5 (~40.00 KiB) from the OS file cache, including disk I/O
- dirtied: 1 (~8.00 KiB)
- writes: 0
Related to #428522
Edited by Schmil Monderer