Skip to content

Use the denormalized attributes to query SBOM occurrences

What does this MR do and why?

With this change, we will be using the denormalized attributes to sort the SBOM occurrences by component name and search the SBOM occurrences by input_file_path instead of joining other tables.

Database review

Migration outputs

This MR introduces a post-deployment migration to create an index, which will be utilized in one of the queries.

rake db:migrate:up
main: == [advisory_lock_connection] object_id: 220740, pg_backend_pid: 17284
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: migrating
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0938s
main: -- index_exists?(:sbom_occurrences, [:project_id, :component_id, :input_file_path], {:name=>"index_sbom_occurrences_for_input_file_path_search", :algorithm=>:concurrently})
main:    -> 0.0037s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0003s
main: -- add_index(:sbom_occurrences, [:project_id, :component_id, :input_file_path], {:name=>"index_sbom_occurrences_for_input_file_path_search", :algorithm=>:concurrently})
main:    -> 0.0015s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: migrated (0.1118s)
rake db:migrate:down
main: == [advisory_lock_connection] object_id: 220760, pg_backend_pid: 18006
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: reverting
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0259s
main: -- indexes(:sbom_occurrences)
main:    -> 0.0078s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0009s
main: -- remove_index(:sbom_occurrences, {:algorithm=>:concurrently, :name=>"index_sbom_occurrences_for_input_file_path_search"})
main:    -> 0.0022s
main: -- execute("RESET statement_timeout")
main:    -> 0.0006s
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: reverted (0.0554s)

main: == [advisory_lock_connection] object_id: 220760, pg_backend_pid: 18006

Queries

There are two queries changed in the scope of this MR.

1. Sorting the sbom_occurrences by component name

I couldn't find any index to improve this query but eliminating the need of joining the sbom_sources table already removes the I/O and improves the performance of the query.

Old query
SELECT
    sbom_occurrences.*,
    agg_occurrences.occurrence_count,
    agg_occurrences.project_count
FROM
    sbom_occurrences
    INNER JOIN (
        SELECT
            component_id,
            COUNT(DISTINCT id) AS occurrence_count,
            COUNT(DISTINCT project_id) AS project_count
        FROM
            sbom_occurrences
        WHERE
            project_id IN (
                SELECT
                    "projects"."id"
                FROM
                    "projects"
                WHERE
                    "projects"."namespace_id" IN (
                        SELECT
                            namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                        FROM
                            "namespaces"
                        WHERE
                            "namespaces"."type" = 'Group'
                            AND (traversal_ids @> ('{68921140}'))))
                GROUP BY
                    component_id) agg_occurrences ON sbom_occurrences.component_id = agg_occurrences.component_id
    INNER JOIN "sbom_components" ON "sbom_components"."id" = "sbom_occurrences"."component_id"
WHERE
    "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{68921140}'))))
    ORDER BY
        sbom_components.name ASC
    LIMIT 20 OFFSET 0;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69297

New query
SELECT
    sbom_occurrences.*,
    agg_occurrences.occurrence_count,
    agg_occurrences.project_count
FROM
    sbom_occurrences
    INNER JOIN (
        SELECT
            component_id,
            COUNT(DISTINCT id) AS occurrence_count,
            COUNT(DISTINCT project_id) AS project_count
        FROM
            sbom_occurrences
        WHERE
            project_id IN (
                SELECT
                    "projects"."id"
                FROM
                    "projects"
                WHERE
                    "projects"."namespace_id" IN (
                        SELECT
                            namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                        FROM
                            "namespaces"
                        WHERE
                            "namespaces"."type" = 'Group'
                            AND (traversal_ids @> ('{68921140}'))))
                GROUP BY
                    component_id) agg_occurrences ON sbom_occurrences.component_id = agg_occurrences.component_id
WHERE
    "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{68921140}'))))
    ORDER BY
        sbom_occurrences.component_name ASC
    LIMIT 20 OFFSET 0;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69298

2. Searching the sbom_occurrences by input_file_path

Both the old and new queries are performing similarly but the old query is joining to the sbom_sources table and the new query is not. Also, we've changed the feature to search sbom_occurrences by exact input_file_path instead of using the ILIKE predicate. This makes it possible to utilize a BTREE index.

Old query
SELECT
    "sbom_occurrences".*
FROM
    "sbom_occurrences"
WHERE
    "sbom_occurrences"."source_id" IN (
        SELECT
            "sbom_sources"."id"
        FROM
            "sbom_sources"
        WHERE (source -> 'input_file' ->> 'path' ILIKE '%yarn%'))
    AND "sbom_occurrences"."component_id" = 216
    AND "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{6543}'))))
    LIMIT 50;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69315

New query
SELECT
    "sbom_occurrences".*
FROM
    "sbom_occurrences"
WHERE
    "sbom_occurrences"."input_file_path" = 'yarn.lock'
    AND "sbom_occurrences"."component_id" = 216
    AND "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{6543}'))))
    LIMIT 50;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69300

Important question for the database maintainer

When I use the ILIKE predicate instead of using = for the second query, I see similar results(both versions execute in ~80ms). I think that the index created in this MR can still be utilized for the ILIKE predicate as probably the database does not need to read the actual rows from the disk to filter them as the column already exists in the index.

Question: Can we use the ILIKE predicate here?

An example query
SELECT
    "sbom_occurrences".*
FROM
    "sbom_occurrences"
WHERE
    "sbom_occurrences"."input_file_path" ILIKE '%yarn%'
    AND "sbom_occurrences"."component_id" = 216
    AND "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{6543}'))))
    LIMIT 50;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69324

Related to #419533 (closed), and #419424 (closed).

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Mehmet Emin INAC

Merge request reports