Improve performance of `Sbom::DependencyLocationsFinder` filtering
Why are we doing this work
As per this discussion and the performance concern that has been raised during this MR review, an improvement of this query is required.
Current performance:
Link to the query plan: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/20400/commands/66625
SELECT "sbom_occurrences".*
FROM "sbom_occurrences"
WHERE "sbom_occurrences"."source_id" IN (
SELECT "sbom_sources"."id"
FROM "sbom_sources"
WHERE (source->'input_file'->>'path' ILIKE '%yarn%')
)
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE "namespaces"."type" = 'Group' AND (traversal_ids @> ('{6543}'))
)
)
LIMIT 50;
Proposed implementation:
Link to the query plan: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/20632/commands/67704
SELECT "sbom_occurrences".*
FROM "sbom_occurrences"
WHERE "sbom_occurrences"."input_file_path" ILIKE '%yarn%'
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND (traversal_ids @> ('{6543}'))
)
)
LIMIT 50;
required steps to the above:
exec alter table sbom_occurrences add column input_file_path text;
exec CREATE INDEX idx_sbom_occurrences_on_input_file_path ON sbom_occurrences USING btree (input_file_path);
exec CREATE INDEX idx_sbom_occurrences_on_project_component_input_file_path ON sbom_occurrences USING btree (project_id,component_id,input_file_path);
exec update sbom_occurrences
set input_file_path = another.input_file_path
from (
select sbom_occurrences.id as id, sbom_sources.source->'input_file'->'path' as input_file_path
from sbom_occurrences
inner join sbom_sources on sbom_sources.id = sbom_occurrences.source_id
where sbom_occurrences.project_id IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND (traversal_ids @> ('{6543}'))
)
)
) another
where sbom_occurrences.id = another.id
and sbom_occurrences.project_id IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND (traversal_ids @> ('{6543}'))
)
);
Relevant links
Non-functional requirements
-
Documentation: -
Feature flag: -
Performance: -
Testing:
Implementation plan
-
database create a text column input_file_path
as part of sbom_occurrences; -
database create a new index for the input_file_path
column; -
database create also a new index for project_id,component_id,input_file_path
; -
backend update query to Sbom::DependencyLocationsFinder
to use the above column; -
backend create a migration to backfill sbom_occurrences.input_file_path
based onsbom_sources.source->'input_file'->'path'
; -
backend update IngestOccurrences
to includeinput_file_path
;
Verification steps
Edited by Zamir Martins