Use the denormalized attributes to query SBOM occurrences
What does this MR do and why?
With this change, we will be using the denormalized attributes to sort the SBOM occurrences by component name and search the SBOM occurrences by input_file_path
instead of joining other tables.
Database review
Migration outputs
This MR introduces a post-deployment migration to create an index, which will be utilized in one of the queries.
rake db:migrate:up
main: == [advisory_lock_connection] object_id: 220740, pg_backend_pid: 17284
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: migrating
main: -- transaction_open?()
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.0938s
main: -- index_exists?(:sbom_occurrences, [:project_id, :component_id, :input_file_path], {:name=>"index_sbom_occurrences_for_input_file_path_search", :algorithm=>:concurrently})
main: -> 0.0037s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0003s
main: -- add_index(:sbom_occurrences, [:project_id, :component_id, :input_file_path], {:name=>"index_sbom_occurrences_for_input_file_path_search", :algorithm=>:concurrently})
main: -> 0.0015s
main: -- execute("RESET statement_timeout")
main: -> 0.0003s
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: migrated (0.1118s)
rake db:migrate:down
main: == [advisory_lock_connection] object_id: 220760, pg_backend_pid: 18006
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: reverting
main: -- transaction_open?()
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.0259s
main: -- indexes(:sbom_occurrences)
main: -> 0.0078s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0009s
main: -- remove_index(:sbom_occurrences, {:algorithm=>:concurrently, :name=>"index_sbom_occurrences_for_input_file_path_search"})
main: -> 0.0022s
main: -- execute("RESET statement_timeout")
main: -> 0.0006s
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: reverted (0.0554s)
main: == [advisory_lock_connection] object_id: 220760, pg_backend_pid: 18006
Queries
There are two queries changed in the scope of this MR.
sbom_occurrences
by component name
1. Sorting the I couldn't find any index to improve this query but eliminating the need of joining the sbom_sources
table already removes the I/O and improves the performance of the query.
Old query
SELECT
sbom_occurrences.*,
agg_occurrences.occurrence_count,
agg_occurrences.project_count
FROM
sbom_occurrences
INNER JOIN (
SELECT
component_id,
COUNT(DISTINCT id) AS occurrence_count,
COUNT(DISTINCT project_id) AS project_count
FROM
sbom_occurrences
WHERE
project_id IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{68921140}'))))
GROUP BY
component_id) agg_occurrences ON sbom_occurrences.component_id = agg_occurrences.component_id
INNER JOIN "sbom_components" ON "sbom_components"."id" = "sbom_occurrences"."component_id"
WHERE
"sbom_occurrences"."project_id" IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{68921140}'))))
ORDER BY
sbom_components.name ASC
LIMIT 20 OFFSET 0;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69297
New query
SELECT
sbom_occurrences.*,
agg_occurrences.occurrence_count,
agg_occurrences.project_count
FROM
sbom_occurrences
INNER JOIN (
SELECT
component_id,
COUNT(DISTINCT id) AS occurrence_count,
COUNT(DISTINCT project_id) AS project_count
FROM
sbom_occurrences
WHERE
project_id IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{68921140}'))))
GROUP BY
component_id) agg_occurrences ON sbom_occurrences.component_id = agg_occurrences.component_id
WHERE
"sbom_occurrences"."project_id" IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{68921140}'))))
ORDER BY
sbom_occurrences.component_name ASC
LIMIT 20 OFFSET 0;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69298
sbom_occurrences
by input_file_path
2. Searching the Both the old and new queries are performing similarly but the old query is joining to the sbom_sources
table and the new query is not. Also, we've changed the feature to search sbom_occurrences
by exact input_file_path
instead of using the ILIKE
predicate. This makes it possible to utilize a BTREE
index.
Old query
SELECT
"sbom_occurrences".*
FROM
"sbom_occurrences"
WHERE
"sbom_occurrences"."source_id" IN (
SELECT
"sbom_sources"."id"
FROM
"sbom_sources"
WHERE (source -> 'input_file' ->> 'path' ILIKE '%yarn%'))
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{6543}'))))
LIMIT 50;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69315
New query
SELECT
"sbom_occurrences".*
FROM
"sbom_occurrences"
WHERE
"sbom_occurrences"."input_file_path" = 'yarn.lock'
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{6543}'))))
LIMIT 50;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69300
database maintainer
Important question for theWhen I use the ILIKE
predicate instead of using =
for the second query, I see similar results(both versions execute in ~80ms). I think that the index created in this MR can still be utilized for the ILIKE
predicate as probably the database does not need to read the actual rows from the disk to filter them as the column already exists in the index.
Question: Can we use the ILIKE
predicate here?
An example query
SELECT
"sbom_occurrences".*
FROM
"sbom_occurrences"
WHERE
"sbom_occurrences"."input_file_path" ILIKE '%yarn%'
AND "sbom_occurrences"."component_id" = 216
AND "sbom_occurrences"."project_id" IN (
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{6543}'))))
LIMIT 50;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69324
Related to #419533 (closed), and #419424 (closed).
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.