Use the denormalized attributes to query SBOM occurrences (!128898) · Merge requests · GitLab.org / GitLab

Mehmet Emin INAC requested to merge minac_419533_419424_change_queries_to_utilize_denormalized_attributes into master Aug 09, 2023

What does this MR do and why?

With this change, we will be using the denormalized attributes to sort the SBOM occurrences by component name and search the SBOM occurrences by input_file_path instead of joining other tables.

Database review

Migration outputs

This MR introduces a post-deployment migration to create an index, which will be utilized in one of the queries.

rake db:migrate:up

main: == [advisory_lock_connection] object_id: 220740, pg_backend_pid: 17284
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: migrating
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0938s
main: -- index_exists?(:sbom_occurrences, [:project_id, :component_id, :input_file_path], {:name=>"index_sbom_occurrences_for_input_file_path_search", :algorithm=>:concurrently})
main:    -> 0.0037s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0003s
main: -- add_index(:sbom_occurrences, [:project_id, :component_id, :input_file_path], {:name=>"index_sbom_occurrences_for_input_file_path_search", :algorithm=>:concurrently})
main:    -> 0.0015s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: migrated (0.1118s)

rake db:migrate:down

main: == [advisory_lock_connection] object_id: 220760, pg_backend_pid: 18006
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: reverting
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0259s
main: -- indexes(:sbom_occurrences)
main:    -> 0.0078s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0009s
main: -- remove_index(:sbom_occurrences, {:algorithm=>:concurrently, :name=>"index_sbom_occurrences_for_input_file_path_search"})
main:    -> 0.0022s
main: -- execute("RESET statement_timeout")
main:    -> 0.0006s
main: == 20230809133249 IndexSbomOccurrencesOnProjectIdComponentIdAndInputFilePath: reverted (0.0554s)

main: == [advisory_lock_connection] object_id: 220760, pg_backend_pid: 18006

Queries

There are two queries changed in the scope of this MR.

1. Sorting the `sbom_occurrences` by component name

I couldn't find any index to improve this query but eliminating the need of joining the sbom_sources table already removes the I/O and improves the performance of the query.

Old query

SELECT
    sbom_occurrences.*,
    agg_occurrences.occurrence_count,
    agg_occurrences.project_count
FROM
    sbom_occurrences
    INNER JOIN (
        SELECT
            component_id,
            COUNT(DISTINCT id) AS occurrence_count,
            COUNT(DISTINCT project_id) AS project_count
        FROM
            sbom_occurrences
        WHERE
            project_id IN (
                SELECT
                    "projects"."id"
                FROM
                    "projects"
                WHERE
                    "projects"."namespace_id" IN (
                        SELECT
                            namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                        FROM
                            "namespaces"
                        WHERE
                            "namespaces"."type" = 'Group'
                            AND (traversal_ids @> ('{68921140}'))))
                GROUP BY
                    component_id) agg_occurrences ON sbom_occurrences.component_id = agg_occurrences.component_id
    INNER JOIN "sbom_components" ON "sbom_components"."id" = "sbom_occurrences"."component_id"
WHERE
    "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{68921140}'))))
    ORDER BY
        sbom_components.name ASC
    LIMIT 20 OFFSET 0;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69297

New query

SELECT
    sbom_occurrences.*,
    agg_occurrences.occurrence_count,
    agg_occurrences.project_count
FROM
    sbom_occurrences
    INNER JOIN (
        SELECT
            component_id,
            COUNT(DISTINCT id) AS occurrence_count,
            COUNT(DISTINCT project_id) AS project_count
        FROM
            sbom_occurrences
        WHERE
            project_id IN (
                SELECT
                    "projects"."id"
                FROM
                    "projects"
                WHERE
                    "projects"."namespace_id" IN (
                        SELECT
                            namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                        FROM
                            "namespaces"
                        WHERE
                            "namespaces"."type" = 'Group'
                            AND (traversal_ids @> ('{68921140}'))))
                GROUP BY
                    component_id) agg_occurrences ON sbom_occurrences.component_id = agg_occurrences.component_id
WHERE
    "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{68921140}'))))
    ORDER BY
        sbom_occurrences.component_name ASC
    LIMIT 20 OFFSET 0;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69298

2. Searching the `sbom_occurrences` by `input_file_path`

Both the old and new queries are performing similarly but the old query is joining to the sbom_sources table and the new query is not. Also, we've changed the feature to search sbom_occurrences by exact input_file_path instead of using the ILIKE predicate. This makes it possible to utilize a BTREE index.

Old query

SELECT
    "sbom_occurrences".*
FROM
    "sbom_occurrences"
WHERE
    "sbom_occurrences"."source_id" IN (
        SELECT
            "sbom_sources"."id"
        FROM
            "sbom_sources"
        WHERE (source -> 'input_file' ->> 'path' ILIKE '%yarn%'))
    AND "sbom_occurrences"."component_id" = 216
    AND "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{6543}'))))
    LIMIT 50;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69315

New query

SELECT
    "sbom_occurrences".*
FROM
    "sbom_occurrences"
WHERE
    "sbom_occurrences"."input_file_path" = 'yarn.lock'
    AND "sbom_occurrences"."component_id" = 216
    AND "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{6543}'))))
    LIMIT 50;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69300

Important question for the database maintainer

When I use the ILIKE predicate instead of using = for the second query, I see similar results(both versions execute in ~80ms). I think that the index created in this MR can still be utilized for the ILIKE predicate as probably the database does not need to read the actual rows from the disk to filter them as the column already exists in the index.

Question: Can we use the ILIKE predicate here?

An example query

SELECT
    "sbom_occurrences".*
FROM
    "sbom_occurrences"
WHERE
    "sbom_occurrences"."input_file_path" ILIKE '%yarn%'
    AND "sbom_occurrences"."component_id" = 216
    AND "sbom_occurrences"."project_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
        WHERE
            "projects"."namespace_id" IN (
                SELECT
                    namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                FROM
                    "namespaces"
                WHERE
                    "namespaces"."type" = 'Group'
                    AND (traversal_ids @> ('{6543}'))))
    LIMIT 50;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21245/commands/69324

Related to #419533 (closed), and #419424 (closed).

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Aug 14, 2023 by Mehmet Emin INAC

Use the denormalized attributes to query SBOM occurrences