Populate denormalized columns for the sbom_occurrences table (!129486) · Merge requests · GitLab.org / GitLab

Mehmet Emin INAC requested to merge minac_populate_denormalized_columns_for_sbom_occurrences into master Aug 16, 2023

What does this MR do and why?

This change introduces a background migration to populate denormalized columns for the sbom_occurrences table.

Database review

There are ~14M records in the sbom_occurrences table so with the batch size of 5K there will be 2_800 batched migration job created. With the job interval of 2 minutes, the migration will take (2800 * 2 / 60 / 24) ~= 4 days to complete.

rake command outputs

rake db:migrate:up

main: == [advisory_lock_connection] object_id: 220440, pg_backend_pid: 26951
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: migrating
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: migrated (0.0661s)

main: == [advisory_lock_connection] object_id: 220440, pg_backend_pid: 26951

rake db:migrate:down

main: == [advisory_lock_connection] object_id: 220940, pg_backend_pid: 26480
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: reverting
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: reverted (0.0242s)

main: == [advisory_lock_connection] object_id: 220940, pg_backend_pid: 26480

Queries

The batching will be done for the sbom_occurrences table by using the primary key.

Updating the `sbom_occurrences` records

WITH update_data AS (
    SELECT
        sbom_occurrences.id AS occurrence_id,
        sbom_components.name AS component_name,
        sbom_sources.source -> 'package_manager' ->> 'name' AS package_manager,
        sbom_sources.source -> 'input_file' ->> 'path' AS input_file_path
    FROM
        sbom_occurrences
        INNER JOIN sbom_components ON sbom_components.id = sbom_occurrences.component_id
        LEFT OUTER JOIN sbom_sources ON sbom_sources.id = sbom_occurrences.source_id
    WHERE
        sbom_occurrences.id IN (2989727, 2989728, 2989729, 2989730, 2989731, 2989732, 2989733, 2989734, 2989735, 2989736, 2989737, 2989738, 2989739, 2989740, 2989741, 2989742, 2989743, 2989744, 2989745, 2989746, 2989747, 2989748, 2989749, 2989750, 2989751, 2989752, 2989753, 2989754, 2989755, 2989756, 2989757, 2989758, 2989759, 2989760, 2989761, 2989762, 2989763, 2989764, 2989765, 2989766, 2989767, 2989768, 2989769, 2989770, 2989771, 2989772, 2989773, 2989774, 2989775, 2989776, 2989777, 2989778, 2989779, 2989780, 2989781, 2989782, 2989783, 2989784, 2989785, 2989786, 2989787, 2989788, 2989789, 2989790, 2989791, 2989792, 2989793, 2989794, 2989795, 2989796, 2989797, 2989798, 2989799, 2989800, 2989801, 2989802, 2989803, 2989804, 2989805, 2989806, 2989807, 2989808, 2989809, 2989810, 2989811, 2989812, 2989813, 2989814, 2989815, 2989816, 2989817, 2989818, 2989819, 2989820, 2989821, 2989822, 2989823, 2989824, 2989825, 2989826))
UPDATE
    sbom_occurrences
SET
    component_name = update_data.component_name,
    package_manager = update_data.package_manager,
    input_file_path = update_data.input_file_path
FROM
    update_data
WHERE
    sbom_occurrences.id = update_data.occurrence_id

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21437/commands/69917

Updating 100 records in a single SQL query takes around 300ms and uses around 45MiB of shared buffers. If I reduce the number of records updated per batch to 50, the execution takes 160ms and the shared buffer usage drops to 23MiB. So it's up to the maintainer to decide on the batch size.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Aug 18, 2023 by Mehmet Emin INAC

Populate denormalized columns for the sbom_occurrences table

What does this MR do and why?

Database review

rake command outputs

Queries

MR acceptance checklist

Merge request reports