Skip to content

Drop Vulnerabilites that would be invalid as well

What does this MR do and why?

Modify Gitlab::BackgroundMigration::RemoveDuplicateVulnerabilitiesFindings to drop Vulnerability objects that would be invalid since the corresponding Vulnerability::Finding object was dropped.

How to set up and validate locally

bundle exec spring rspec spec/lib/gitlab/background_migration/remove_duplicate_vulnerabilities_findings_spec.rb

Database review

This MR is NOT going to schedule this background migration, the scheduling is going to happen in a different MR.

I also updated the specs to reflect actual values from the database so that we're 100% sure this is going to have the intended effect.

How many records are affected?

We have ~26463 pairs of duplicates so ~26463 records will be dropped. Any of those Vulnerability::Finding objects can be assigned to a Vulnerability so in the worst case scenario we will be:

  • dropping 26463 Vulnerability::Finding records
  • dropping 26463 Vulnerability records

Based on:

SELECT DISTINCT report_type, location_fingerprint, primary_identifier_id, project_id, array_agg(id) as ids, array_agg(uuid) as uuids
FROM vulnerability_occurrences
GROUP BY report_type, location_fingerprint, primary_identifier_id, project_id
HAVING (COUNT(*) > 1) AND (array_length(array_agg(vulnerability_id) FILTER (WHERE vulnerability_id IS NOT NULL), 1) = 1);

there are 199 duplicate pairs where only one Vulnerabilities::Finding has a Vulnerability associated so the count of records dropped would be 26463 and 26264

Selecting a batch of 1000 Vulnerability::Findings

EXPLAIN ANALYZE WITH "batch" AS MATERIALIZED (
    SELECT "vulnerability_occurrences"."report_type", "vulnerability_occurrences"."location_fingerprint", "vulnerability_occurrences"."primary_identifier_id", "vulnerability_occurrences"."project_id"
    FROM "vulnerability_occurrences"
    WHERE "vulnerability_occurrences"."id" BETWEEN 1 AND 1260
)
SELECT DISTINCT "batch"."report_type", "batch"."location_fingerprint", "batch"."primary_identifier_id", "batch"."project_id", array_agg(id) as ids
FROM "batch"
AS batch
INNER JOIN vulnerability_occurrences ON vulnerability_occurrences.report_type = batch.report_type AND vulnerability_occurrences.location_fingerprint = batch.location_fingerprint AND vulnerability_occurrences.primary_identifier_id = batch.primary_identifier_id AND vulnerability_occurrences.project_id = batch.project_id
GROUP BY "batch"."report_type", "batch"."location_fingerprint", "batch"."primary_identifier_id", "batch"."project_id"
HAVING (COUNT(*) > 1);

cold cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6785/commands/23979

warm cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6785/commands/23971

Selecting associated Vulnerabilities

My understanding is that we would usually have two duplicate Vulnerabilities::Finding and both of those Vulnerabilities::Finding can have an associated Vulnerability:

SELECT "vulnerability_occurrences"."vulnerability_id" FROM "vulnerability_occurrences" WHERE "vulnerability_occurrences"."id" IN (5606961, 8765432);

This query is relatively simple so I think we can assume it's not going to timeout.

Dropping a batch of Vulnerability::Findings objects

DELETE FROM "vulnerability_occurrences" WHERE "vulnerability_occurrences"."id" IN (5606961, 8765432);

cold cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6785/commands/23975

warm cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6785/commands/23976

Dropping a batch of associated Vulnerability objects

DELETE FROM "vulnerabilities" WHERE "vulnerabilities"."id" IN (51, 52);

cold cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6785/commands/23977

warm cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6785/commands/23978

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #341917

Edited by Michał Zając

Merge request reports