Implement new ingestion flow for the security reports (!66735) · Merge requests · GitLab.org / GitLab

Mehmet Emin INAC requested to merge 276498_introduce_new_vulnerability_ingestion_service_classes into master Jul 23, 2021

What does this MR do?

This MR introduces a new ingestion flow for the security reports. The previous service classes were out of control and people were hesitant to make changes in those so we wanted to implement this new ingestion flow.

Important Note: These changes won't replace or cover all the features provided by the StoreReportService. There are still some missing features and functionalities but since this MR is already quite huge, we want to merge this and address the missing parts with follow-up MRs. The missing parts are;

Ingestion of the vulnerability flags
Application layer validations(We have to disable some validations and use the bulk_insert_safe).
This new ingestion flow depends on the UUID values of findings to be correct and this may not be the case for all the projects but we already have a running migration to fix the UUID values(/cc @subashis).

Database Review

This new ingestion service heavily depends on PostgreSQLs upsert functionality and also uses bulk insert and bulk update methods. Here is the list of the SQL queries generated by the ingestion services;

Before ingestion starts

Loading the security scans

database review: looks good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22791

/*application:web,line:/ee/app/services/security/ingestion/ingest_reports_service.rb:29:in `flat_map'*/
SELECT
    "security_scans".*
FROM
    "security_scans"
WHERE
    "security_scans"."pipeline_id" = 704
    AND (jsonb_array_length(COALESCE(info -> 'errors', '[]'::jsonb)) = 0)
    AND "security_scans"."latest" = TRUE

Loading the security findings

database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22792

/*application:web,line:/ee/app/services/security/ingestion/finding_map_collection.rb:17:in `each'*/
SELECT
    "security_findings".*
FROM
    "security_findings"
WHERE
    "security_findings"."scan_id" = 893
    AND "security_findings"."deduplicated" = TRUE

Loading the job artifacts

database review: seems okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22793 and https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22794

/*application:web,line:/ee/app/models/security/scan.rb:79:in `job_artifact'*/
SELECT
    "ci_builds".*
FROM
    "ci_builds"
WHERE
    "ci_builds"."type" = 'Ci::Build'
    AND "ci_builds"."id" = 3133
LIMIT 1

/*application:web,line:/ee/app/models/security/scan.rb:79:in `job_artifact'*/
SELECT
    "ci_job_artifacts".*
FROM
    "ci_job_artifacts"
WHERE
    "ci_job_artifacts"."job_id" = 3133
    AND "ci_job_artifacts"."file_type" = 6
LIMIT 1

Actual ingestion process

Ingesting the vulnerability identifiers

This query creates records in the vulnerability_identifiers table. At some point within the ingestion flow, we associate these records with findings by creating records in a join table(See "Associating the findings with identifiers").

database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22795

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_identifiers" ("external_id", "external_type", "fingerprint", "name", "url", "project_id", "created_at", "updated_at")
    VALUES ('37283ed4-0380-40d7-ada7-2d994afcc62a', 'gemnasium', '\x59f480b0e4ac2114b22ef406635d1b7143fa61f2', 'Gemnasium-37283ed4-0380-40d7-ada7-2d994afcc62a', 'https://deps.sec.gitlab.com/packages/npm/debug/versions/1.0.5/advisories', 51, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721'), ('9952e574-7b5b-46fa-a270-aeb694198a98', 'gemnasium', '\xb4ecee7f3b6e9dc8f92d902c99d39bd449a9e0bc', 'Gemnasium-9952e574-7b5b-46fa-a270-aeb694198a98', 'https://deps.sec.gitlab.com/packages/npm/saml2-js/versions/1.5.0/advisories', 51, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721'), ('CVE-2017-11429', 'cve', '\x38a94219aa20b920a6379e3e6ed466d694077ef5', 'CVE-2017-11429', 'https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-11429', 51, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721')
ON CONFLICT ("project_id", "fingerprint")
    DO UPDATE SET
        "external_id" = excluded."external_id", "external_type" = excluded."external_type", "name" = excluded."name", "url" = excluded."url", "created_at" = excluded."created_at", "updated_at" = excluded."updated_at"
    RETURNING
        "fingerprint", "id"

Ingesting the findings(vulnerability_occurrences)

This query creates records in the vulnerability_occurrences table(a.k.a. findings). It utilizes the unique column called uuid to do UPSERT.

The return values of the query are the id and vulnerability_id of the rows. We use the vulnerability_id to determine if the finding is a new record or not. Based on this value, we either create or update records in the next step.

database review: OK - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22797

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_occurrences" ("confidence", "metadata_version", "name", "raw_metadata", "report_type", "severity", "details", "description", "message", "solution", "cve", "location", "primary_identifier_id", "location_fingerprint", "project_fingerprint", "uuid", "scanner_id", "project_id", "created_at", "updated_at")
    VALUES (0, 'manual', 'elo', '{"user": {"name": "sally"}}', 99, 1, '{"user": {"name": "sally"}}', 'asd', 'asd', 'asd', 'asdas', '{"user": {"name": "sally"}}', 1892049, 'asdasdasdaddsadasdsa', 'asdasdasdsdasd', '569760c5-1373-5be3-b066-04ccb6c0a0aa', 37054, 27633223, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721')
ON CONFLICT ("uuid")
    DO UPDATE SET
        "confidence" = excluded."confidence", "metadata_version" = excluded."metadata_version", "name" = excluded."name", "raw_metadata" = excluded."raw_metadata", "report_type" = excluded."report_type", "severity" = excluded."severity", "details" = excluded."details", "description" = excluded."description", "message" = excluded."message", "solution" = excluded."solution", "cve" = excluded."cve", "location" = excluded."location", "primary_identifier_id" = excluded."primary_identifier_id", "location_fingerprint" = excluded."location_fingerprint", "project_fingerprint" = excluded."project_fingerprint", "scanner_id" = excluded."scanner_id", "project_id" = excluded."project_id", "updated_at" = excluded."updated_at"
    RETURNING
        "id", "vulnerability_id"

Creating the new vulnerabilities

If we have NULL vulnerability_id values from the previous query, it means that we've created new findings which need to have new vulnerability records to be created in the database and this query creates those records. We don't run this query if there are no new records.

database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22798

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerabilities" ("author_id", "project_id", "title", "state", "severity", "confidence", "report_type", "dismissed_at", "dismissed_by_id", "updated_at", "created_at")
    VALUES (329579, 27633223, 'Regular Expression Denial of Service in debug', 1, 2, 2, 1, NULL, NULL, '2021-09-13 13:45:07.502056', '2021-09-13 13:45:07.502072'), (329579, 27633223, 'Authentication bypass via incorrect DOM traversal and canonicalization in saml2-js', 1, 2, 2, 1, NULL, NULL, '2021-09-13 13:45:07.502085', '2021-09-13 13:45:07.502089')
ON CONFLICT
    DO NOTHING
RETURNING
    id

Updating the existing vulnerabilities

database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22799

/*application:web,line:/ee/app/services/security/ingestion/bulk_updatable_task.rb:41:in `execute'*/
UPDATE
    vulnerabilities
SET
    id = map.id::bigint,
    title = map.title::character varying(255),
    severity = map.severity::smallint,
    confidence = map.confidence::smallint,
    updated_at = map.updated_at::timestamp WITH time zone
FROM (
    VALUES (20788, 'Regular Expression Denial of Service in debug', 2, 2, '2021-09-10 17:12:48.118919'),
        (20789, 'Authentication bypass via incorrect DOM traversal and canonicalization in saml2-js', 2, 2, '2021-09-10 17:12:48.118935')) AS map (id, title, severity, confidence, updated_at)
WHERE
    vulnerabilities.id = map.id

Inserting the finding_pipeline records

Associates the findings with pipelines by creating records in a join table called vulnerability_occurrence_pipelines.

database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22800

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_occurrence_pipelines" ("pipeline_id", "occurrence_id", "updated_at", "created_at")
    VALUES (704, 25033, '2021-09-10 17:12:48.141864', '2021-09-10 17:12:48.141878'), (704, 25034, '2021-09-10 17:12:48.141884', '2021-09-10 17:12:48.141887')
ON CONFLICT
    DO NOTHING
RETURNING
    "id"

Associating the findings with identifiers

Associates the findings with the previously upserted identifiers by creating the in the vulnerability_occurrence_identifiers join table.

database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22801

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_occurrence_identifiers" ("occurrence_id", "identifier_id", "updated_at", "created_at")
    VALUES (25033, 5558, '2021-09-10 17:12:48.156374', '2021-09-10 17:12:48.156389'), (25034, 5559, '2021-09-10 17:12:48.156395', '2021-09-10 17:12:48.156398'), (25034, 5560, '2021-09-10 17:12:48.156401', '2021-09-10 17:12:48.156403')
ON CONFLICT ("occurrence_id", "identifier_id")
    DO UPDATE SET
        "updated_at" = excluded."updated_at"
    RETURNING
        "id"

Loading the existing finding links

database review: OK - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22802

/*application:web,line:/ee/app/services/security/ingestion/tasks/ingest_finding_links.rb:37:in `group_by'*/
SELECT
    "vulnerability_finding_links".*
FROM
    "vulnerability_finding_links"
WHERE
    "vulnerability_finding_links"."vulnerability_occurrence_id" IN (25033, 25034)

Creating the new finding links

database review: looks good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22803

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_finding_links" ("vulnerability_occurrence_id", "name", "url", "updated_at", "created_at")
    VALUES (25033, NULL, 'https://nodesecurity.io/advisories/534', '2021-09-10 17:12:48.179365', '2021-09-10 17:12:48.179380'), (25033, NULL, 'https://github.com/visionmedia/debug/issues/501', '2021-09-10 17:12:48.179385', '2021-09-10 17:12:48.179388'), (25033, NULL, 'https://github.com/visionmedia/debug/pull/504', '2021-09-10 17:12:48.179391', '2021-09-10 17:12:48.179393'), (25034, NULL, 'https://github.com/Clever/saml2/commit/3546cb61fd541f219abda364c5b919633609ef3d#diff-af730f9f738de1c9ad87596df3f6de84R279', '2021-09-10 17:12:48.179401', '2021-09-10 17:12:48.179403'), (25034, NULL, 'https://github.com/Clever/saml2/issues/127', '2021-09-10 17:12:48.179406', '2021-09-10 17:12:48.179409'), (25034, NULL, 'https://www.kb.cert.org/vuls/id/475445', '2021-09-10 17:12:48.179412', '2021-09-10 17:12:48.179414')
ON CONFLICT
    DO NOTHING
RETURNING
    "id"

Loading the existing remediations for the project with particular checksums

The checksum column holds the SHA1 hash value of the remediation file content. We load the existing records into memory to filter the new records and create the records for them only.

database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22804

/*application:web,line:/ee/app/services/security/ingestion/tasks/ingest_remediations.rb:63:in `existing_remediations'*/
SELECT
    "vulnerability_remediations".*
FROM
    "vulnerability_remediations"
WHERE
    "vulnerability_remediations"."project_id" = 51
    AND "vulnerability_remediations"."checksum" = '\xf00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee'

Creating the remediations

Vulnerabilities::Remediation is an ActiveRecord model with file attachment. The actual remediation content(git diff) is stored in a file attached to the record as the remediation content can be huge. For this reason, the ingestion of remediations happens one by one to run the ActiveRecord callbacks which upload the file, etc.

database review: looks okay

/*application:web,line:/lib/gitlab/database.rb:244:in `block in transaction'*/
INSERT INTO "vulnerability_remediations" ("created_at", "updated_at", "summary", "file", "checksum", "project_id")
    VALUES ('2021-09-13 14:25:48.300322', '2021-09-13 14:25:48.300322', 'Upgrade saml2-js', 'f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee.diff', '\xf00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee', 27633223)
RETURNING
    "id"

/*application:web,line:/app/uploaders/records_uploads.rb:30:in `readd_upload'*/
DELETE FROM "uploads"
WHERE "uploads"."id" IN (
        SELECT
            "uploads"."id"
        FROM
            "uploads"
        WHERE
            "uploads"."uploader" = 'AttachmentUploader'
            AND "uploads"."model_type" = 'Vulnerabilities::Remediation'
            AND "uploads"."model_id" = 137
            AND "uploads"."path" = 'uploads/-/system/vulnerabilities/remediation/file/137/f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee.diff'
        ORDER BY
            "uploads"."id" DESC)

/*application:web,line:/app/uploaders/records_uploads.rb:33:in `tap'*/
INSERT INTO "uploads" ("size", "path", "checksum", "model_id", "model_type", "uploader", "created_at", "mount_point")
    VALUES (9748, 'uploads/-/system/vulnerabilities/remediation/file/137/f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee.diff', 'f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee', 137, 'Vulnerabilities::Remediation', 'AttachmentUploader', '2021-09-13 14:25:48.307406', 'file')
RETURNING
    "id"

/*application:web,line:/app/models/concerns/file_store_mounter.rb:19:in `update_file_store'*/
UPDATE
    "vulnerability_remediations"
SET
    "file_store" = 1
WHERE
    "vulnerability_remediations"."id" = 137

Associating the findings with remediations

With this query, we associate the findings with the previously created remediations.

database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22809

/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_findings_remediations" ("vulnerability_occurrence_id", "vulnerability_remediation_id", "updated_at", "created_at")
    VALUES (6814415, 682, '2021-09-10 17:12:48.225164', '2021-09-10 17:12:48.225181'), (6897629, 818, '2021-09-10 17:12:48.225194', '2021-09-10 17:12:48.225198')
ON CONFLICT ("vulnerability_occurrence_id", "vulnerability_remediation_id")
    DO UPDATE SET
        "updated_at" = excluded."updated_at"
    RETURNING
        id

Removing the orphan finding remediation associations

database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22810

/*application:web,line:/ee/app/services/security/ingestion/tasks/ingest_remediations.rb:22:in `after_ingest'*/
DELETE FROM "vulnerability_findings_remediations"
WHERE "vulnerability_findings_remediations"."vulnerability_occurrence_id" IN (25033, 25034)
    AND "vulnerability_findings_remediations"."id" NOT IN (51984, 51985)

Backend Review

Design explanation

In this MR, I am introducing 3 new sets of classes;

Utility classes to use during the ingestion;
- FindingMap: Encapsulates the data being used during the ingestion and supports some operations to manipulate its state. Even though it looks like a magic container(an anti-pattern), it is an argument object class as it provides methods to interact with its state.
- FindingMapCollection: This kinda has two responsibilities;
  - Acts as a factory class for the FindingMap objects
  - Acts as a collection of FindingMap objects by implementing the Enumerable interface
The service classes by the dependency order;
- IngestReportsService: Initiates the ingestion process and updates some meta-data like; has_vulnerabilities, latest_pipeline_id, etc.
- IngestReportService: Responsible for slicing down the findings to ingest for the given security_scan.
- IngestReportSliceService Runs the series of tasks for the given collection of finding_map objects. This is introduced to run ingestion in transactional batches to prevent leaving the database in an inconsistent state. I didn't want to run everything in one transaction as long-running transactions can put the burden on the database's shoulders.
The task classes and the IngestReportSliceService class are implemented to achieve an easy-to-extend design(pipeline-filter pattern). With this design, when we want to implement a new feature, we can just introduce a new TaskClass and encapsulate the logic in it without touching any other place(FindingMap can be changed though). Here is the list of Task classes implemented in this MR;
- IngestIdentifiers
- IngestFindings
- IngestVulnerabilities
- AttachFindingsToVulnerabilities
- IngestFindingPipelines
- IngestFindingIdentifiers
- IngestFindingLinks
- IngestFindingSignatures
- IngestRemediations

Benchmark between the new flow and the old one

The benchmark script can be found here: https://gitlab.com/-/snippets/2165653.

The following results are collected by running the benchmark for a pipeline of 2 security scans with 1002 findings.

#########################
Results for "ingestion"
NUMBER OF QUERIES: 282
DURATION: 438.042000
#########################
#########################
Results for "store"
NUMBER OF QUERIES: 10101
DURATION: 3476.353012
#########################
Calculating -------------------------------------
           ingestion     1.000  i/100ms
               store     1.000  i/100ms
-------------------------------------------------
           ingestion      0.526  (± 0.0%) i/s -      3.000  in   5.732515s
               store      0.034  (± 0.0%) i/s -      1.000  in  29.321573s

Comparison:
           ingestion:        0.5 i/s
               store:        0.0 i/s - 15.41x slower

Comparison between the new flow and the old one

From a high-level of abstraction point of view, there is no difference between the new ingestion flow and the old one;

The biggest difference is, how we organize the responsibilities, transactional ingestion & error handling;

Related to #276498 (closed)

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

[-] I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
[-] I have tested this MR in all supported browsers, or it's not needed.
[-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited Sep 17, 2021 by Michał Zając

Implement new ingestion flow for the security reports