Skip to content

Populate the `uuid` attributes of `security_findings` records

What does this MR do?

This MR introduces a new background migration to populate missing uuid values for the security_findings records.

If it can populate the uuid value for the record, it will also try to set the finding_uuid attribute of the related vulnerability_feedback records and if not, it will remove the record from the database.

Database review

This MR introduces two rake migrations;

  1. To introduce a temporary index to speed up the next one.
  2. To schedule the background jobs and the background job class to run actual data migration.

Rake migration

The rake migration will schedule jobs for 104_502 records with the batch of 1K which means 105 jobs will be scheduled.

The query which iterates through the records in batches is;
SELECT DISTINCT
    "security_findings"."scan_id"
FROM
    "security_findings"
WHERE
    "security_findings"."uuid" IS NULL
ORDER BY
    "security_findings"."scan_id" ASC
LIMIT $1

The query plan;

Limit  (cost=7726.55..7729.05 rows=1000 width=8) (actual time=36.900..37.020 rows=604 loops=1)
   Buffers: shared hit=723
   ->  Sort  (cost=7726.55..7841.38 rows=45933 width=8) (actual time=36.898..36.947 rows=604 loops=1)
         Sort Key: security_findings.scan_id
         Sort Method: quicksort  Memory: 53kB
         Buffers: shared hit=723
         ->  HashAggregate  (cost=4748.76..5208.09 rows=45933 width=8) (actual time=36.394..36.761 rows=604 loops=1)
               Group Key: security_findings.scan_id
               Buffers: shared hit=723
               ->  Index Only Scan using index_security_findings_on_uuid_and_scan_id on public.security_findings  (cost=0.56..4496.86 rows=100760 width=8) (actual time=0.025..15.955 rows=104547 loops=1)
                     Index Cond: (security_findings.uuid IS NULL)
                     Heap Fetches: 309
                     Buffers: shared hit=723

As seen in the query plan, the planner is choosing an already existing compound index on uuid and scan_id columns.

https://explain.depesz.com/s/asYE

https://gitlab.slack.com/archives/CLJMDRD8C/p1611260871175600

rake db:migrate:up VERSION=20210111075104
== 20210111075104 AddTemporaryIndexOnSecurityFindingsScanId: migrating ========
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:security_findings, :scan_id, {:where=>"uuid is null", :name=>"tmp_index_on_security_findings_scan_id", :algorithm=>:concurrently})
   -> 0.0052s
-- add_index(:security_findings, :scan_id, {:where=>"uuid is null", :name=>"tmp_index_on_security_findings_scan_id", :algorithm=>:concurrently})
   -> 0.0086s
== 20210111075104 AddTemporaryIndexOnSecurityFindingsScanId: migrated (0.0145s)

== 20210111075105 ScheduleUuidPopulationForSecurityFindings: migrating ========
== 20210111075105 ScheduleUuidPopulationForSecurityFindings: migrated (0.0210s)
rake db:migrate:down VERSION=20210111075104
== 20210111075104 AddTemporaryIndexOnSecurityFindingsScanId: reverting ========
-- transaction_open?()
   -> 0.0000s
-- indexes(:security_findings)
   -> 0.0049s
-- remove_index(:security_findings, {:algorithm=>:concurrently, :name=>"tmp_index_on_security_findings_scan_id"})
   -> 0.0037s
== 20210111075104 AddTemporaryIndexOnSecurityFindingsScanId: reverted (0.0094s)
rake db:migrate:up VERSION=20210111075105
== 20210111075105 ScheduleUuidPopulationForSecurityFindings: migrating ========
== 20210111075105 ScheduleUuidPopulationForSecurityFindings: migrated (0.0286s)
rake db:migrate:down VERSION=20210111075105
== 20210111075105 ScheduleUuidPopulationForSecurityFindings: reverting ========
== 20210111075105 ScheduleUuidPopulationForSecurityFindings: reverted (0.0000s)

Background migration

The background job will receive a maximum of 1K IDs of security_scans records and will load the records into memory(1). After loading the security_scan records, the background job will try to download the job artifact. If the job artifact is still available, then the job will parse the artifact and generate the uuid values for all the findings and update each security_findings records(2). Then it will try to find the related vulnerability_feedback records for the findings(3) and will set the finding_uuid attribute of those records(4). And as the very last step, it will remove the security_findings records from the database with missing uuid values(5).

1.1) Reading the `security_scans` records into memory;
SELECT
    "security_scans".*
FROM
    "security_scans"
WHERE
    "security_scans"."id" IN ($1, $2, ...)

The query plan;

Index Scan using security_scans_pkey on public.security_scans  (cost=0.43..7.40 rows=3 width=34) (actual time=5.512..5.524 rows=3 loops=1)
   Index Cond: (security_scans.id = ANY ('{1,2,3}'::bigint[]))
   Buffers: shared hit=9 read=4
   I/O Timings: read=5.413

https://explain.depesz.com/s/fe0J

https://gitlab.slack.com/archives/CLJMDRD8C/p1611261975178200

1.2) Reading the associated `ci_builds`;
SELECT
    "ci_builds".*
FROM
    "ci_builds"
WHERE
    "ci_builds"."id" IN ($1, $2, ...)

The query plan;

Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.57..7.94 rows=3 width=1569) (actual time=7.886..7.903 rows=3 loops=1)
   Index Cond: (ci_builds.id = ANY ('{1,2,3}'::integer[]))
   Buffers: shared hit=11 read=5
   I/O Timings: read=7.790

https://explain.depesz.com/s/vB0o

https://gitlab.slack.com/archives/CLJMDRD8C/p1611262159180700

1.3) Reading the associated `ci_pipelines`;

SELECT
    "ci_pipelines".*
FROM
    "ci_pipelines"
WHERE
    "ci_pipelines"."id" IN ($1, $2, ...)

The query plan;

Index Scan using ci_pipelines_pkey on public.ci_pipelines  (cost=0.57..7.96 rows=3 width=316) (actual time=7.776..7.790 rows=3 loops=1)
   Index Cond: (ci_pipelines.id = ANY ('{1,2,3}'::integer[]))
   Buffers: shared hit=8 read=5
   I/O Timings: read=7.695

https://explain.depesz.com/s/owVO

https://gitlab.slack.com/archives/CLJMDRD8C/p1611262306183200

1.4) Reading the associated `ci_job_artifacts`;

SELECT
    "ci_job_artifacts".*
FROM
    "ci_job_artifacts"
WHERE
    "ci_job_artifacts"."job_id" IN ($1, $2, ...)

The query plan;

Index Scan using index_ci_job_artifacts_on_job_id_and_file_type on public.ci_job_artifacts  (cost=0.57..39.63 rows=36 width=130) (actual time=7.390..9.183 rows=3 loops=1)
   Index Cond: (ci_job_artifacts.job_id = ANY ('{1,2,3}'::integer[]))
   Buffers: shared hit=8 read=7
   I/O Timings: read=9.060

https://explain.depesz.com/s/Ls2s

https://gitlab.slack.com/archives/CLJMDRD8C/p1611262410185700

2) Updating the `security_findings` record;
UPDATE
    "security_findings"
SET
    "uuid" = $1
WHERE
    "security_findings"."scan_id" = $2
    AND "security_findings"."position" = $3

The query plan;

ModifyTable on public.security_findings  (cost=0.56..5.37 rows=3 width=96) (actual time=7.292..7.293 rows=0 loops=1)
   Buffers: shared read=4
   I/O Timings: read=7.235
   ->  Index Scan using index_security_findings_on_scan_id_and_position on public.security_findings  (cost=0.56..5.37 rows=3 width=96) (actual time=7.290..7.290 rows=0 loops=1)
         Index Cond: ((security_findings.scan_id = 0) AND (security_findings."position" = 0))
         Buffers: shared read=4
         I/O Timings: read=7.235

https://explain.depesz.com/s/WLiS1

https://gitlab.slack.com/archives/CLJMDRD8C/p1611262765192800

3) Reading the related `vulnerability_feedback` record(s);

This is done by batchloader for all findings of a scan.

SELECT
    "vulnerability_feedback".*
FROM
    "vulnerability_feedback"
WHERE
    "vulnerability_feedback"."project_id" = $1
    AND "vulnerability_feedback"."category" IN ($2, $3)
    AND "vulnerability_feedback"."project_fingerprint" IN ($4, $5, $6, $7)

The query plan;

 Index Scan using vulnerability_feedback_unique_idx on public.vulnerability_feedback  (cost=0.42..4.96 rows=1 width=135) (actual time=2.148..2.149 rows=0 loops=1)
   Index Cond: (vulnerability_feedback.project_id = 1)
   Filter: ((vulnerability_feedback.category = ANY ('{1,2}'::integer[])) AND ((vulnerability_feedback.project_fingerprint)::text = ANY ('{foo,bar}'::text[])))
   Rows Removed by Filter: 0
   Buffers: shared read=3
   I/O Timings: read=2.097

https://explain.depesz.com/s/4y7S

https://gitlab.slack.com/archives/CLJMDRD8C/p1611262944195300

4) Updating the `vulnerability_feedback` record(s);
UPDATE
    "vulnerability_feedback"
SET
    "finding_uuid" = $1
WHERE
    "vulnerability_feedback"."id" = $2

The query plan;

 ModifyTable on public.vulnerability_feedback  (cost=0.42..3.44 rows=1 width=141) (actual time=28.809..28.810 rows=0 loops=1)
   Buffers: shared hit=21 read=32 dirtied=11
   I/O Timings: read=28.161
   ->  Index Scan using vulnerability_feedback_pkey on public.vulnerability_feedback  (cost=0.42..3.44 rows=1 width=141) (actual time=5.068..5.071 rows=1 loops=1)
         Index Cond: (vulnerability_feedback.id = 1)
         Buffers: shared read=4
         I/O Timings: read=4.999

https://explain.depesz.com/s/c8pT

https://gitlab.slack.com/archives/CLJMDRD8C/p1611263035197800

5) Deleting the `security_findings` record(s);

This is done in batch.

DELETE FROM "security_findings"
WHERE "security_findings"."scan_id" = $1
    AND "security_findings"."uuid" IS NULL
    AND "security_findings"."id" >= $2

The query plan;

ModifyTable on public.security_findings  (cost=0.56..3.58 rows=1 width=6) (actual time=0.036..0.037 rows=0 loops=1)
   Buffers: shared hit=7
   ->  Index Scan using index_security_findings_on_uuid_and_scan_id on public.security_findings  (cost=0.56..3.58 rows=1 width=6) (actual time=0.034..0.034 rows=0 loops=1)
         Index Cond: ((security_findings.uuid IS NULL) AND (security_findings.scan_id = 1))
         Filter: (security_findings.id >= 680)
         Rows Removed by Filter: 0
         Buffers: shared hit=7

https://explain.depesz.com/s/6kAd

https://gitlab.slack.com/archives/CLJMDRD8C/p1611262598188900

Timing forecast

The scheduler will schedule 105 background jobs each of which will run with 2 minutes of gap consecutively which in total makes 105 * 2 = 210 minutes.

Related to #277327 (closed)

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Mehmet Emin INAC

Merge request reports