Getting the list of licenses to search for the dependency list is slow
Summary
In order to populate the license search picklist, we have to execute a query that is very slow on large groups.
Query:
EXPLAIN SELECT DISTINCT
sbom_licenses.spdx_identifier,
sbom_licenses.name,
sbom_licenses.url
FROM
sbom_occurrences
LEFT JOIN jsonb_to_recordset( sbom_occurrences.licenses ) AS sbom_licenses (spdx_identifier text, name text, url text) ON true
WHERE
sbom_occurrences.project_id IN (
SELECT
projects.id
FROM
projects
WHERE
projects.namespace_id IN (
SELECT
namespaces.traversal_ids[array_length(
namespaces.traversal_ids,
1
)] AS id
FROM
namespaces
WHERE
namespaces.type = 'Group' AND
traversal_ids @> '{9970}'
) AND
projects.marked_for_deletion_at IS NULL AND
projects.pending_delete = false
) AND
licenses <> '[]' AND
sbom_licenses.spdx_identifier IS NOT NULL
ORDER BY
sbom_licenses.spdx_identifier ASC
LIMIT 100;
DB Lab: https://console.postgres.ai/shared/34f1a946-9524-4143-abae-c22c00fc9537
Execution Plan:
Limit (cost=256934.13..256934.38 rows=100 width=96) (actual time=124116.240..124116.272 rows=96 loops=1)
Buffers: shared hit=17334 read=81355 dirtied=2202
I/O Timings: read=121638.439 write=0.000
-> Sort (cost=256934.13..256934.38 rows=100 width=96) (actual time=124116.236..124116.260 rows=96 loops=1)
Sort Key: sbom_licenses.spdx_identifier
Sort Method: quicksort Memory: 39kB
Buffers: shared hit=17334 read=81355 dirtied=2202
I/O Timings: read=121638.439 write=0.000
-> HashAggregate (cost=256929.81..256930.81 rows=100 width=96) (actual time=124116.052..124116.091 rows=96 loops=1)
Group Key: sbom_licenses.spdx_identifier, sbom_licenses.name, sbom_licenses.url
Buffers: shared hit=17331 read=81355 dirtied=2202
I/O Timings: read=121638.439 write=0.000
-> Nested Loop (cost=20607.03..252763.56 rows=555500 width=96) (actual time=13517.717..123845.886 rows=73172 loops=1)
Buffers: shared hit=17331 read=81355 dirtied=2202
I/O Timings: read=121638.439 write=0.000
-> Nested Loop (cost=20607.02..245495.03 rows=5555 width=75) (actual time=13517.677..123438.121 rows=70205 loops=1)
Buffers: shared hit=17331 read=81355 dirtied=2202
I/O Timings: read=121638.439 write=0.000
-> HashAggregate (cost=20606.46..20731.72 rows=12526 width=4) (actual time=13473.886..13488.122 rows=3662 loops=1)
Group Key: projects.id
Buffers: shared hit=1885 read=5144 dirtied=198
I/O Timings: read=13235.142 write=0.000
-> Nested Loop (cost=1058.01..20575.14 rows=12526 width=4) (actual time=2683.125..13454.540 rows=3662 loops=1)
Buffers: shared hit=1885 read=5144 dirtied=198
I/O Timings: read=13235.142 write=0.000
-> HashAggregate (cost=1057.57..1062.66 rows=509 width=28) (actual time=2653.519..2656.166 rows=811 loops=1)
Group Key: namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)]
Buffers: shared hit=169 read=727 dirtied=34
I/O Timings: read=2604.582 write=0.000
-> Bitmap Heap Scan on public.namespaces (cost=258.50..1056.29 rows=509 width=28) (actual time=12.881..2646.939 rows=811 loops=1)
Buffers: shared hit=169 read=727 dirtied=34
I/O Timings: read=2604.582 write=0.000
-> Bitmap Index Scan using index_namespaces_on_traversal_ids_for_groups (cost=0.00..258.37 rows=509 width=0) (actual time=5.487..5.489 rows=811 loops=1)
Index Cond: (namespaces.traversal_ids @> '{9970}'::integer[])
Buffers: shared hit=169 read=1
I/O Timings: read=1.223 write=0.000
-> Index Scan using index_projects_on_namespace_id_and_repository_size_limit on public.projects (cost=0.44..38.08 rows=25 width=8) (actual time=3.819..13.299 rows=5 loops=811)
Index Cond: (projects.namespace_id = (namespaces.traversal_ids)[array_length(namespaces.traversal_ids, 1)])
Filter: ((projects.marked_for_deletion_at IS NULL) AND (NOT projects.pending_delete))
Rows Removed by Filter: 0
Buffers: shared hit=1716 read=4417 dirtied=164
I/O Timings: read=10630.560 write=0.000
-> Index Scan using idx_sbom_occurrences_on_project_id_and_source_id on public.sbom_occurrences (cost=0.56..13.76 rows=418 width=83) (actual time=3.743..30.001 rows=19 loops=3662)
Index Cond: (sbom_occurrences.project_id = projects.id)
Filter: (sbom_occurrences.licenses <> '[]'::jsonb)
Rows Removed by Filter: 10
Buffers: shared hit=15446 read=76211 dirtied=2004
I/O Timings: read=108403.297 write=0.000
-> Memoize (cost=0.01..1.01 rows=100 width=96) (actual time=0.002..0.003 rows=1 loops=70205)
I/O Timings: read=0.000 write=0.000
-> Function Scan on jsonb_to_recordset sbom_licenses (cost=0.00..1.00 rows=100 width=96) (actual time=0.015..0.016 rows=2 loops=172)
Filter: (sbom_licenses.spdx_identifier IS NOT NULL)
Rows Removed by Filter: 0
I/O Timings: read=0.000 write=0.000
Timings:
Time: 2.069 min
- planning: 10.009 ms
- execution: 2.069 min
- I/O read: 2.027 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 17334 (~135.40 MiB) from the buffer pool
- reads: 81355 (~635.60 MiB) from the OS file cache, including disk I/O
- dirtied: 2202 (~17.20 MiB)
- writes: 0
This query is expensive because we have to read all sbom_occurrences
in the group before we can deduplicate the licenses and narrow down the list to unique values only.
Edited by Brian Williams