Skip to content

Populate denormalized columns for the sbom_occurrences table

What does this MR do and why?

This change introduces a background migration to populate denormalized columns for the sbom_occurrences table.

Related to #419533 (closed), #419424 (closed), and #415665 (closed).

Database review

There are ~14M records in the sbom_occurrences table so with the batch size of 5K there will be 2_800 batched migration job created. With the job interval of 2 minutes, the migration will take (2800 * 2 / 60 / 24) ~= 4 days to complete.

rake command outputs

rake db:migrate:up
main: == [advisory_lock_connection] object_id: 220440, pg_backend_pid: 26951
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: migrating
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: migrated (0.0661s)

main: == [advisory_lock_connection] object_id: 220440, pg_backend_pid: 26951
rake db:migrate:down
main: == [advisory_lock_connection] object_id: 220940, pg_backend_pid: 26480
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: reverting
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: reverted (0.0242s)

main: == [advisory_lock_connection] object_id: 220940, pg_backend_pid: 26480

Queries

The batching will be done for the sbom_occurrences table by using the primary key.

Updating the `sbom_occurrences` records
WITH update_data AS (
    SELECT
        sbom_occurrences.id AS occurrence_id,
        sbom_components.name AS component_name,
        sbom_sources.source -> 'package_manager' ->> 'name' AS package_manager,
        sbom_sources.source -> 'input_file' ->> 'path' AS input_file_path
    FROM
        sbom_occurrences
        INNER JOIN sbom_components ON sbom_components.id = sbom_occurrences.component_id
        LEFT OUTER JOIN sbom_sources ON sbom_sources.id = sbom_occurrences.source_id
    WHERE
        sbom_occurrences.id IN (2989727, 2989728, 2989729, 2989730, 2989731, 2989732, 2989733, 2989734, 2989735, 2989736, 2989737, 2989738, 2989739, 2989740, 2989741, 2989742, 2989743, 2989744, 2989745, 2989746, 2989747, 2989748, 2989749, 2989750, 2989751, 2989752, 2989753, 2989754, 2989755, 2989756, 2989757, 2989758, 2989759, 2989760, 2989761, 2989762, 2989763, 2989764, 2989765, 2989766, 2989767, 2989768, 2989769, 2989770, 2989771, 2989772, 2989773, 2989774, 2989775, 2989776, 2989777, 2989778, 2989779, 2989780, 2989781, 2989782, 2989783, 2989784, 2989785, 2989786, 2989787, 2989788, 2989789, 2989790, 2989791, 2989792, 2989793, 2989794, 2989795, 2989796, 2989797, 2989798, 2989799, 2989800, 2989801, 2989802, 2989803, 2989804, 2989805, 2989806, 2989807, 2989808, 2989809, 2989810, 2989811, 2989812, 2989813, 2989814, 2989815, 2989816, 2989817, 2989818, 2989819, 2989820, 2989821, 2989822, 2989823, 2989824, 2989825, 2989826))
UPDATE
    sbom_occurrences
SET
    component_name = update_data.component_name,
    package_manager = update_data.package_manager,
    input_file_path = update_data.input_file_path
FROM
    update_data
WHERE
    sbom_occurrences.id = update_data.occurrence_id

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21437/commands/69917

Updating 100 records in a single SQL query takes around 300ms and uses around 45MiB of shared buffers. If I reduce the number of records updated per batch to 50, the execution takes 160ms and the shared buffer usage drops to 23MiB. So it's up to the maintainer to decide on the batch size.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Mehmet Emin INAC

Merge request reports