Remove duplicates from vulnerability_occurrences
What does this MR do?
This MR removes rows from vulnerability_occurrences for which we would get a duplicate UUIDv5. See #292239 (closed) for more information about why this is a problem.
Related to #212322 (closed)
Queries
Total time expected is (6039407/10000) * 0.061ms ~= 36840.3827ms ~= 37s. According to https://docs.gitlab.com/ee/development/database_review.html#timing-guidelines-for-migrations a post-migration has recommended execution time of 10 minutes which should be plenty.
Adding index
CREATE INDEX tmp_idx_deduplicate_vulnerability_occurrences ON vulnerability_occurrences(project_id,report_type,location_fingerprint,primary_identifier_id,id);Timings
The query has been executed. Duration: 3.453 minSelecting one batch of duplicates
WITH batch AS (
  SELECT vulnerability_occurrences.report_type, vulnerability_occurrences.location_fingerprint, vulnerability_occurrences.primary_identifier_id, vulnerability_occurrences.project_id 
  FROM vulnerability_occurrences 
  WHERE vulnerability_occurrences.id >= 1
  AND vulnerability_occurrences.id < 10000
) 
SELECT DISTINCT batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id, array_agg(id) as ids 
FROM batch 
AS batch 
INNER JOIN vulnerability_occurrences 
ON vulnerability_occurrences.report_type = batch.report_type
AND vulnerability_occurrences.location_fingerprint = batch.location_fingerprint
AND vulnerability_occurrences.primary_identifier_id = batch.primary_identifier_id
AND vulnerability_occurrences.project_id = batch.project_id
GROUP BY batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id
HAVING (COUNT(*) > 1);https://explain.depesz.com/s/V4li
Query plan
 Unique  (cost=40808.54..40808.56 rows=1 width=78) (actual time=60.442..60.442 rows=0 loops=1)
   CTE batch
     ->  Index Scan using vulnerability_occurrences_pkey on vulnerability_occurrences vulnerability_occurrences_1  (cost=0.43..9079.63 rows=8972 width=35) (actual time=0.018..12.312 rows=8985 loops=1)
           Index Cond: ((id >= 1) AND (id < 10000))
   ->  Sort  (cost=31728.90..31728.91 rows=1 width=78) (actual time=60.441..60.441 rows=0 loops=1)
         Sort Key: batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id, (array_agg(vulnerability_occurrences.id))
         Sort Method: quicksort  Memory: 25kB
         ->  GroupAggregate  (cost=31728.86..31728.89 rows=1 width=78) (actual time=60.435..60.435 rows=0 loops=1)
               Group Key: batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id
               Filter: (count(*) > 1)
               Rows Removed by Filter: 8985
               ->  Sort  (cost=31728.86..31728.87 rows=1 width=54) (actual time=55.904..56.443 rows=8985 loops=1)
                     Sort Key: batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id
                     Sort Method: quicksort  Memory: 1086kB
                     ->  Nested Loop  (cost=0.56..31728.85 rows=1 width=54) (actual time=0.039..45.779 rows=8985 loops=1)
                           ->  CTE Scan on batch  (cost=0.00..179.44 rows=8972 width=46) (actual time=0.019..15.632 rows=8985 loops=1)
                           ->  Index Scan using index_vulnerability_occurrences_on_unique_keys on vulnerability_occurrences  (cost=0.56..3.51 rows=1 width=43) (actual time=0.003..0.003 rows=1 loops=8985)
                                 Index Cond: ((project_id = batch.project_id) AND (primary_identifier_id = batch.primary_identifier_id) AND (location_fingerprint = batch.location_fingerprint))
                                 Filter: (batch.report_type = report_type)
 Planning Time: 0.447 ms
 Execution Time: 60.784 msDoes this MR meet the acceptance criteria?
Conformity
- 
Changelog entry 
- 
Documentation (if required) 
- 
Code review guidelines 
- 
Merge request performance guidelines 
- 
Style guides 
- 
Database guides 
- 
Separation of EE specific content 
Availability and Testing
- 
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. 
- [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention @gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team
Edited  by Michał Zając