Populate denormalized columns for the sbom_occurrences table
What does this MR do and why?
This change introduces a background migration to populate denormalized columns for the sbom_occurrences
table.
Related to #419533 (closed), #419424 (closed), and #415665 (closed).
Database review
There are ~14M records in the sbom_occurrences
table so with the batch size of 5K there will be 2_800 batched migration job created. With the job interval of 2 minutes, the migration will take (2800 * 2 / 60 / 24) ~= 4 days to complete.
rake command outputs
rake db:migrate:up
main: == [advisory_lock_connection] object_id: 220440, pg_backend_pid: 26951
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: migrating
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: migrated (0.0661s)
main: == [advisory_lock_connection] object_id: 220440, pg_backend_pid: 26951
rake db:migrate:down
main: == [advisory_lock_connection] object_id: 220940, pg_backend_pid: 26480
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: reverting
main: == 20230815140656 QueuePopulateDenormalizedColumnsForSbomOccurrences: reverted (0.0242s)
main: == [advisory_lock_connection] object_id: 220940, pg_backend_pid: 26480
Queries
The batching will be done for the sbom_occurrences
table by using the primary key.
Updating the `sbom_occurrences` records
WITH update_data AS (
SELECT
sbom_occurrences.id AS occurrence_id,
sbom_components.name AS component_name,
sbom_sources.source -> 'package_manager' ->> 'name' AS package_manager,
sbom_sources.source -> 'input_file' ->> 'path' AS input_file_path
FROM
sbom_occurrences
INNER JOIN sbom_components ON sbom_components.id = sbom_occurrences.component_id
LEFT OUTER JOIN sbom_sources ON sbom_sources.id = sbom_occurrences.source_id
WHERE
sbom_occurrences.id IN (2989727, 2989728, 2989729, 2989730, 2989731, 2989732, 2989733, 2989734, 2989735, 2989736, 2989737, 2989738, 2989739, 2989740, 2989741, 2989742, 2989743, 2989744, 2989745, 2989746, 2989747, 2989748, 2989749, 2989750, 2989751, 2989752, 2989753, 2989754, 2989755, 2989756, 2989757, 2989758, 2989759, 2989760, 2989761, 2989762, 2989763, 2989764, 2989765, 2989766, 2989767, 2989768, 2989769, 2989770, 2989771, 2989772, 2989773, 2989774, 2989775, 2989776, 2989777, 2989778, 2989779, 2989780, 2989781, 2989782, 2989783, 2989784, 2989785, 2989786, 2989787, 2989788, 2989789, 2989790, 2989791, 2989792, 2989793, 2989794, 2989795, 2989796, 2989797, 2989798, 2989799, 2989800, 2989801, 2989802, 2989803, 2989804, 2989805, 2989806, 2989807, 2989808, 2989809, 2989810, 2989811, 2989812, 2989813, 2989814, 2989815, 2989816, 2989817, 2989818, 2989819, 2989820, 2989821, 2989822, 2989823, 2989824, 2989825, 2989826))
UPDATE
sbom_occurrences
SET
component_name = update_data.component_name,
package_manager = update_data.package_manager,
input_file_path = update_data.input_file_path
FROM
update_data
WHERE
sbom_occurrences.id = update_data.occurrence_id
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/21437/commands/69917
Updating 100 records in a single SQL query takes around 300ms and uses around 45MiB of shared buffers. If I reduce the number of records updated per batch to 50, the execution takes 160ms and the shared buffer usage drops to 23MiB. So it's up to the maintainer to decide on the batch size.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.