Improve performance of `Sbom::DependencyLocationsFinder` filtering

Why are we doing this work

As per this discussion and the performance concern that has been raised during this MR review, an improvement of this query is required.

Current performance:

Link to the query plan: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/20400/commands/66625

SELECT "sbom_occurrences".*
FROM "sbom_occurrences"
WHERE "sbom_occurrences"."source_id" IN (
  SELECT "sbom_sources"."id"
  FROM "sbom_sources"
  WHERE (source->'input_file'->>'path' ILIKE '%yarn%')
)
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
  SELECT "projects"."id"
  FROM "projects"
  WHERE "projects"."namespace_id" IN (
    SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
    FROM "namespaces"
    WHERE "namespaces"."type" = 'Group' AND (traversal_ids @> ('{6543}'))
  )
)
LIMIT 50;

Proposed implementation:

Link to the query plan: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/20632/commands/67704

SELECT "sbom_occurrences".*
FROM "sbom_occurrences"
WHERE "sbom_occurrences"."input_file_path" ILIKE '%yarn%'
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
  SELECT "projects"."id"
  FROM "projects"
  WHERE "projects"."namespace_id" IN (
    SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
    FROM "namespaces"
    WHERE "namespaces"."type" = 'Group'
    AND (traversal_ids @> ('{6543}'))
  )
)
LIMIT 50;

required steps to the above:

exec alter table sbom_occurrences add column input_file_path text;
exec CREATE INDEX idx_sbom_occurrences_on_input_file_path ON sbom_occurrences USING btree (input_file_path);
exec CREATE INDEX idx_sbom_occurrences_on_project_component_input_file_path ON sbom_occurrences USING btree (project_id,component_id,input_file_path);
exec update sbom_occurrences
set input_file_path = another.input_file_path
from (
  select sbom_occurrences.id as id, sbom_sources.source->'input_file'->'path' as input_file_path
  from sbom_occurrences
  inner join sbom_sources on sbom_sources.id = sbom_occurrences.source_id
  where sbom_occurrences.project_id IN (
    SELECT "projects"."id"
    FROM "projects"
    WHERE "projects"."namespace_id" IN (
      SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
      FROM "namespaces"
      WHERE "namespaces"."type" = 'Group'
      AND (traversal_ids @> ('{6543}'))
    )
    ) 
) another
where sbom_occurrences.id = another.id
and sbom_occurrences.project_id IN (
  SELECT "projects"."id"
  FROM "projects"
  WHERE "projects"."namespace_id" IN (
    SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
    FROM "namespaces"
    WHERE "namespaces"."type" = 'Group'
    AND (traversal_ids @> ('{6543}'))
  )
);

Relevant links

Non-functional requirements

  • Documentation:
  • Feature flag:
  • Performance:
  • Testing:

Implementation plan

  • database create a text column input_file_path as part of sbom_occurrences;
  • database create a new index for the input_file_path column;
  • database create also a new index for project_id,component_id,input_file_path;
  • backend update query to Sbom::DependencyLocationsFinder to use the above column;
  • backend create a migration to backfill sbom_occurrences.input_file_path based on sbom_sources.source->'input_file'->'path';
  • backend update IngestOccurrences to include input_file_path;

Verification steps

Edited by Zamir Martins