Backfill vulnerability_count background migration
What does this MR do and why?
This MR adds a batched background migration to backfill the vulnerability_count
column inside the project_statistics
table.
We are running this migration on the project_settings
table, to enable better filtering using the has_vulnerabilities
column.
This is the third checkbox of this issue.
Must be merged only after this MR is merged.
The process of the migration:
We scope only projects that has vulnerabilities.
SELECT project_id
FROM project_settings
WHERE has_vulnerabilities = true
Then, we count the number of vulnerabilities those projects have:
SELECT count(*)
FROM vulnerabilities
WHERE project_id = <project_id>
And finally, we set this count to the project_statistics.vulnerability_count column
UPDATE project_statistics
SET vulnerability_count = <vulnerability_count>
WHERE project_id = project_id
Database review
Here are the queries used by the batched background migration;
Finding the lower bound of the sub-batch
SELECT
"project_settings"."project_id"
FROM
"project_settings"
WHERE
"project_settings"."project_id" BETWEEN 1 AND 213128123
AND (has_vulnerabilities IS TRUE)
ORDER BY
"project_settings"."project_id" ASC
LIMIT 1
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96949
Finding the upper bound of the sub-batch
SELECT
"project_settings"."project_id"
FROM
"project_settings"
WHERE
"project_settings"."project_id" BETWEEN 1 AND 213128123
AND (has_vulnerabilities IS TRUE)
AND "project_settings"."project_id" >= 21
ORDER BY
"project_settings"."project_id" ASC
LIMIT 1 OFFSET 100
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96950
Loading the project statistics record
SELECT
"project_statistics".*
FROM
"project_statistics"
WHERE
"project_statistics"."project_id" = 278964
LIMIT 1
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96951
Finding the latest vulnerability record for the project
SELECT
vulnerability_reads.*
FROM
vulnerability_reads
WHERE
vulnerability_reads.project_id = 278964
ORDER BY
vulnerability_reads.vulnerability_id DESC
LIMIT 1
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96952
Finding the lower bound for the vulnerability batching
SELECT
"vulnerability_reads"."vulnerability_id"
FROM
"vulnerability_reads"
WHERE (project_id = 278964
AND vulnerability_id <= 9231721312)
ORDER BY
"vulnerability_reads"."vulnerability_id" ASC
LIMIT 1
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96954
Finding the upper bound of a batch iteration for the vulnerability batching
SELECT
"vulnerability_reads"."vulnerability_id"
FROM
"vulnerability_reads"
WHERE (project_id = 278964
AND vulnerability_id <= 9231721312)
AND "vulnerability_reads"."vulnerability_id" >= 11449
ORDER BY
"vulnerability_reads"."vulnerability_id" ASC
LIMIT 1 OFFSET 1000
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96955
Counting the number of vulnerabilities in batch
SELECT
COUNT(*)
FROM
"vulnerability_reads"
WHERE (project_id = 278964
AND vulnerability_id <= 9231721312)
AND "vulnerability_reads"."vulnerability_id" >= 11449
AND "vulnerability_reads"."vulnerability_id" < 12449
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96956
Updating the project statistics
UPDATE
project_statistics
SET
vulnerability_count = vulnerability_count + 1000
WHERE
id = 278964
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31211/commands/96957
There are currently 275_850 project_settings
records in the database with has_vulnerabilities IS TRUE
. With the batch size of 1_000 records, we will run 276 jobs. Each job takes 2 minutes to complete so the total runtime will be around 9 hours.