Skip to content

Populate `resolved_on_default_branch` column for existing vulnerabilities

What does this MR do?

This MR introduces a post-migration task to schedule a data migration job which will populate the resolved_on_default_branch column of existing vulnerabilities.

Migraration up

== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: migrating ====
-- table_exists?(:projects)
   -> 0.0007s
-- table_exists?(:vulnerabilities)
   -> 0.0005s
== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: migrated (0.1239s)

Migraration down

== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: reverting ====
== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: reverted (0.0000s)

Database Queries

Scheduler post-migration task

The scheduler post-migration task schedules a data migration task for every 100 projects by using the following query multiple times;

SELECT DISTINCT
    "vulnerabilities"."project_id"
FROM
    "vulnerabilities"
WHERE
    "vulnerabilities"."project_id" >= 1
ORDER BY
    "vulnerabilities"."project_id" ASC
LIMIT 1 OFFSET 100
 Limit  (cost=1480.47..1495.27 rows=1 width=8) (actual time=957.839..957.839 rows=1 loops=1)
   Buffers: shared hit=637 read=971
   I/O Timings: read=917.190
   ->  Unique  (cost=0.43..79123.27 rows=5346 width=8) (actual time=0.056..957.820 rows=101 loops=1)
         Buffers: shared hit=637 read=971
         I/O Timings: read=917.190
         ->  Index Only Scan using index_vulnerabilities_on_project_id on public.vulnerabilities  (cost=0.43..72160.16 rows=2785242 width=8) (actual time=0.054..942.893 rows=134338 loops=1)
               Index Cond: (vulnerabilities.project_id >= 1)
               Heap Fetches: 0
               Buffers: shared hit=637 read=971
               I/O Timings: read=917.190

https://gitlab.slack.com/archives/CLJMDRD8C/p1597843572499800

Background task

Background task takes project_ids argument as an array of integers and runs the migration for each of them with the following queries;

Load project
SELECT
    "projects".*
FROM
    "projects"
WHERE
    "projects"."id" = $project_id
LIMIT 1
 Limit  (cost=0.43..3.45 rows=1 width=723) (actual time=11.330..11.330 rows=1 loops=1)
   Buffers: shared read=4
   I/O Timings: read=11.218
   ->  Index Scan using projects_pkey on public.projects  (cost=0.43..3.45 rows=1 width=723) (actual time=11.327..11.327 rows=1 loops=1)
         Index Cond: (projects.id = 278964)
         Buffers: shared read=4
         I/O Timings: read=11.218

https://gitlab.slack.com/archives/CLJMDRD8C/p1597677861201100

Load Route

This task also loads the route entry for each project as it has some crucial information for project entity like so;

SELECT
    "routes".*
FROM
    "routes"
WHERE
    "routes"."source_id" = $project_id
    AND "routes"."source_type" = 'Project'
LIMIT 1
 Limit  (cost=0.56..3.58 rows=1 width=80) (actual time=13.741..13.742 rows=1 loops=1)
   Buffers: shared read=5
   I/O Timings: read=13.681
   ->  Index Scan using index_routes_on_source_type_and_source_id on public.routes  (cost=0.56..3.58 rows=1 width=80) (actual time=13.740..13.740 rows=1 loops=1)
         Index Cond: (((routes.source_type)::text = 'Project'::text) AND (routes.source_id = 278964))
         Buffers: shared read=5
         I/O Timings: read=13.681

https://gitlab.slack.com/archives/CLJMDRD8C/p1597677948203500

Load latest successful pipeline with security reports

After loading the project & route entries, the task tries to load the latest successful pipeline ID with security reports to load the artifact entry later based on.

SELECT
    "ci_pipelines"."id"
FROM
    "ci_pipelines"
WHERE ci_pipelines.project_id = 278964
AND ci_pipelines.ref = 'master'
AND ci_pipelines.status IN ('success')
AND (EXISTS (
        SELECT
            1
        FROM
            "ci_builds"
        WHERE
            "ci_builds"."type" = 'Ci::Build'
            AND ("ci_builds"."retried" IS FALSE
                OR "ci_builds"."retried" IS NULL)
            AND (EXISTS (
                    SELECT
                        1
                    FROM
                        "ci_job_artifacts"
                    WHERE (ci_builds.id = ci_job_artifacts.job_id)
                    AND "ci_job_artifacts"."file_type" IN (5, 6, 7, 8, 21, 23)))
            AND (ci_pipelines.id = ci_builds.commit_id)))
ORDER BY
    "ci_pipelines"."id" DESC
LIMIT 1
Limit  (cost=1.84..23444.68 rows=1 width=4) (actual time=0.927..0.927 rows=1 loops=1)
   Buffers: shared hit=959
   ->  Nested Loop Semi Join  (cost=1.84..25599582.66 rows=1092 width=4) (actual time=0.926..0.926 rows=1 loops=1)
         Buffers: shared hit=959
         ->  Index Only Scan using index_ci_pipelines_on_project_id_and_ref_and_status_and_id on public.ci_pipelines  (cost=0.57..2948.65 rows=79343 width=4) (actual time=0.040..0.040 rows=1 loops=1)
               Index Cond: ((ci_pipelines.project_id = 278964) AND (ci_pipelines.ref = 'master'::text) AND (ci_pipelines.status = 'success'::text))
               Heap Fetches: 1
               Buffers: shared hit=6
         ->  Nested Loop Semi Join  (cost=1.27..322.60 rows=1 width=4) (actual time=0.884..0.884 rows=1 loops=1)
               Buffers: shared hit=953
               ->  Index Scan using index_ci_builds_on_commit_id_and_status_and_type on public.ci_builds  (cost=0.70..102.17 rows=113 width=8) (actual time=0.016..0.215 rows=121 loops=1)
                     Index Cond: ((ci_builds.commit_id = ci_pipelines.id) AND ((ci_builds.type)::text = 'Ci::Build'::text))
                     Filter: ((ci_builds.retried IS FALSE) OR (ci_builds.retried IS NULL))
                     Rows Removed by Filter: 0
                     Buffers: shared hit=129
               ->  Index Only Scan using index_ci_job_artifacts_on_job_id_and_file_type on public.ci_job_artifacts  (cost=0.57..1.94 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=121)
                     Index Cond: (ci_job_artifacts.job_id = ci_builds.id)
                     Heap Fetches: 427
                     Filter: (ci_job_artifacts.file_type = ANY ('{5,6,7,8,21,23}'::integer[]))
                     Rows Removed by Filter: 4
                     Buffers: shared hit=824

https://gitlab.slack.com/archives/CLJMDRD8C/p1597933858070400

Load JobArtifact based on latest successful pipeline ID

After loading the latest successful pipeline ID from the database, the task tries to load the job artifacts based on the pipeline ID for security report files(5, 6, 7, 8, 21, 23).

SELECT
    "ci_job_artifacts".*
FROM
    "ci_job_artifacts"
    INNER JOIN "ci_builds" ON "ci_job_artifacts"."job_id" = "ci_builds"."id"
        AND "ci_builds"."commit_id" = $commit_id
        AND "ci_builds"."type" = 'Ci::Build'
        AND ("ci_builds"."retried" IS FALSE
            OR "ci_builds"."retried" IS NULL)
WHERE
    "ci_job_artifacts"."file_type" IN (5, 6, 7, 8, 21, 23)
 Nested Loop  (cost=1.27..1217.33 rows=1 width=131) (actual time=3.510..3.510 rows=0 loops=1)
   Buffers: shared hit=3 read=2
   I/O Timings: read=3.480
   ->  Index Scan using index_ci_builds_on_commit_id_and_status_and_type on public.ci_builds  (cost=0.70..129.25 rows=115 width=4) (actual time=3.509..3.509 rows=0 loops=1)
         Index Cond: ((ci_builds.commit_id = 177652417) AND ((ci_builds.type)::text = 'Ci::Build'::text))
         Filter: ((ci_builds.retried IS FALSE) OR (ci_builds.retried IS NULL))
         Rows Removed by Filter: 0
         Buffers: shared hit=3 read=2
         I/O Timings: read=3.480
   ->  Index Scan using index_ci_job_artifacts_on_job_id_and_file_type on public.ci_job_artifacts  (cost=0.57..9.45 rows=1 width=131) (actual time=0.000..0.000 rows=0 loops=0)
         Index Cond: (ci_job_artifacts.job_id = ci_builds.id)
         Filter: (ci_job_artifacts.file_type = ANY ('{5,6,7,8,21,23}'::integer[]))
         Rows Removed by Filter: 0

https://gitlab.slack.com/archives/CLJMDRD8C/p1597360148010400

Load vulnerability scanners for project
SELECT
    "vulnerability_scanners".*
FROM
    "vulnerability_scanners"
WHERE
    "vulnerability_scanners"."project_id" = $project_id
    AND "vulnerability_scanners"."external_id" IN ($external_id_1, $external_id_2)
Index Scan using index_vulnerability_scanners_on_project_id_and_external_id on public.vulnerability_scanners  (cost=0.29..5.10 rows=1 width=54) (actual time=2.118..2.118 rows=0 loops=1)
   Index Cond: ((vulnerability_scanners.project_id = 278964) AND ((vulnerability_scanners.external_id)::text = ANY ('{external_id_1,external_id_2}'::text[])))
   Buffers: shared hit=5 read=2
   I/O Timings: read=1.969

https://gitlab.slack.com/archives/CLJMDRD8C/p1597678042205900

Load vulnerability identifiers for project
SELECT
    "vulnerability_identifiers".*
FROM
    "vulnerability_identifiers"
WHERE
    "vulnerability_identifiers"."project_id" = $project_id
    AND "vulnerability_identifiers"."fingerprint" IN ($fingerprint_1, $fingerprint_2)
Index Scan using index_vulnerability_identifiers_on_project_id_and_fingerprint on public.vulnerability_identifiers  (cost=0.42..5.38 rows=1 width=149) (actual time=2.195..2.195 rows=0 loops=1)
   Index Cond: ((vulnerability_identifiers.project_id = 278964) AND (vulnerability_identifiers.fingerprint = ANY ('{"\\x66696e6765727072696e745f31","\\x66696e6765727072696e745f32"}'::bytea[])))
   Buffers: shared hit=6 read=3
   I/O Timings: read=2.112

https://gitlab.slack.com/archives/CLJMDRD8C/p1597678127208300

Load findings from database
SELECT
    "vulnerability_occurrences".*
FROM
    "vulnerability_occurrences"
WHERE
    "vulnerability_occurrences"."project_id" = $project_id
    AND "vulnerability_occurrences"."scanner_id" = $scanner_id
    AND "vulnerability_occurrences"."primary_identifier_id" = $identifier_id
    AND "vulnerability_occurrences"."location_fingerprint" = $location_fingerprint
LIMIT 1
 Limit  (cost=0.56..3.58 rows=1 width=1113) (actual time=8.433..8.434 rows=0 loops=1)
   Buffers: shared hit=3 read=4
   I/O Timings: read=8.373
   ->  Index Scan using index_vulnerability_occurrences_on_unique_keys on public.vulnerability_occurrences  (cost=0.56..3.58 rows=1 width=1113) (actual time=8.432..8.432 rows=0 loops=1)
         Index Cond: ((vulnerability_occurrences.project_id = 278964) AND (vulnerability_occurrences.primary_identifier_id = 1) AND (vulnerability_occurrences.location_fingerprint = '\x6c6f636174696f6e5f66696e6765727072696e74'::bytea) AND (vulnerability_occurrences.scanner_id = 1))
         Buffers: shared hit=3 read=4
         I/O Timings: read=8.373

https://gitlab.slack.com/archives/CLJMDRD8C/p1597678191210700

Update disappeared vulnerabilities
UPDATE
    "vulnerabilities"
SET
    "resolved_on_default_branch" = true
WHERE
    "vulnerabilities"."project_id" = $project_id
    AND "vulnerabilities"."id" NOT IN ($vulnerability_id_1, $vulnerability_id_2)
 ModifyTable on public.vulnerabilities  (cost=0.43..140.02 rows=93 width=295) (actual time=0.084..0.084 rows=0 loops=1)
   Buffers: shared hit=3
   ->  Index Scan using index_vulnerabilities_on_project_id on public.vulnerabilities  (cost=0.43..140.02 rows=93 width=295) (actual time=0.063..0.063 rows=0 loops=1)
         Index Cond: (vulnerabilities.project_id = 1)
         Filter: (vulnerabilities.id <> 1)
         Rows Removed by Filter: 0
         Buffers: shared hit=3

https://gitlab.slack.com/archives/CLJMDRD8C/p1597844859016800

Expected timing

explain select distinct project_id from vulnerabilities

6590 project ids.

(6590 / 100) * 5 minutes ~= 330 minutes to finish the migration for all projects.

Related to #227114 (closed)

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Mehmet Emin INAC

Merge request reports