Implement new ingestion flow for the security reports
What does this MR do?
This MR introduces a new ingestion flow for the security reports. The previous service classes were out of control and people were hesitant to make changes in those so we wanted to implement this new ingestion flow.
Important Note: These changes won't replace or cover all the features provided by the StoreReportService
. There are still some missing features and functionalities but since this MR is already quite huge, we want to merge this and address the missing parts with follow-up MRs. The missing parts are;
- Ingestion of the vulnerability flags
- Application layer validations(We have to disable some validations and use the
bulk_insert_safe
). - This new ingestion flow depends on the UUID values of findings to be correct and this may not be the case for all the projects but we already have a running migration to fix the UUID values(/cc @subashis).
Database Review
This new ingestion service heavily depends on PostgreSQLs upsert functionality and also uses bulk insert and bulk update methods. Here is the list of the SQL queries generated by the ingestion services;
Before ingestion starts
Loading the security scans
database review: looks good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22791
/*application:web,line:/ee/app/services/security/ingestion/ingest_reports_service.rb:29:in `flat_map'*/
SELECT
"security_scans".*
FROM
"security_scans"
WHERE
"security_scans"."pipeline_id" = 704
AND (jsonb_array_length(COALESCE(info -> 'errors', '[]'::jsonb)) = 0)
AND "security_scans"."latest" = TRUE
Loading the security findings
database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22792
/*application:web,line:/ee/app/services/security/ingestion/finding_map_collection.rb:17:in `each'*/
SELECT
"security_findings".*
FROM
"security_findings"
WHERE
"security_findings"."scan_id" = 893
AND "security_findings"."deduplicated" = TRUE
Loading the job artifacts
database review: seems okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22793 and https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22794
/*application:web,line:/ee/app/models/security/scan.rb:79:in `job_artifact'*/
SELECT
"ci_builds".*
FROM
"ci_builds"
WHERE
"ci_builds"."type" = 'Ci::Build'
AND "ci_builds"."id" = 3133
LIMIT 1
/*application:web,line:/ee/app/models/security/scan.rb:79:in `job_artifact'*/
SELECT
"ci_job_artifacts".*
FROM
"ci_job_artifacts"
WHERE
"ci_job_artifacts"."job_id" = 3133
AND "ci_job_artifacts"."file_type" = 6
LIMIT 1
Actual ingestion process
Ingesting the vulnerability identifiers
This query creates records in the vulnerability_identifiers
table. At some point within the ingestion flow, we associate these records with findings by creating records in a join table(See "Associating the findings with identifiers").
database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22795
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_identifiers" ("external_id", "external_type", "fingerprint", "name", "url", "project_id", "created_at", "updated_at")
VALUES ('37283ed4-0380-40d7-ada7-2d994afcc62a', 'gemnasium', '\x59f480b0e4ac2114b22ef406635d1b7143fa61f2', 'Gemnasium-37283ed4-0380-40d7-ada7-2d994afcc62a', 'https://deps.sec.gitlab.com/packages/npm/debug/versions/1.0.5/advisories', 51, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721'), ('9952e574-7b5b-46fa-a270-aeb694198a98', 'gemnasium', '\xb4ecee7f3b6e9dc8f92d902c99d39bd449a9e0bc', 'Gemnasium-9952e574-7b5b-46fa-a270-aeb694198a98', 'https://deps.sec.gitlab.com/packages/npm/saml2-js/versions/1.5.0/advisories', 51, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721'), ('CVE-2017-11429', 'cve', '\x38a94219aa20b920a6379e3e6ed466d694077ef5', 'CVE-2017-11429', 'https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-11429', 51, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721')
ON CONFLICT ("project_id", "fingerprint")
DO UPDATE SET
"external_id" = excluded."external_id", "external_type" = excluded."external_type", "name" = excluded."name", "url" = excluded."url", "created_at" = excluded."created_at", "updated_at" = excluded."updated_at"
RETURNING
"fingerprint", "id"
Ingesting the findings(vulnerability_occurrences)
This query creates records in the vulnerability_occurrences
table(a.k.a. findings). It utilizes the unique column called uuid
to do UPSERT.
The return values of the query are the id
and vulnerability_id
of the rows. We use the vulnerability_id
to determine if the finding is a new record or not. Based on this value, we either create or update records in the next step.
database review: OK - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22797
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_occurrences" ("confidence", "metadata_version", "name", "raw_metadata", "report_type", "severity", "details", "description", "message", "solution", "cve", "location", "primary_identifier_id", "location_fingerprint", "project_fingerprint", "uuid", "scanner_id", "project_id", "created_at", "updated_at")
VALUES (0, 'manual', 'elo', '{"user": {"name": "sally"}}', 99, 1, '{"user": {"name": "sally"}}', 'asd', 'asd', 'asd', 'asdas', '{"user": {"name": "sally"}}', 1892049, 'asdasdasdaddsadasdsa', 'asdasdasdsdasd', '569760c5-1373-5be3-b066-04ccb6c0a0aa', 37054, 27633223, '2021-09-10 17:12:48.053703', '2021-09-10 17:12:48.053721')
ON CONFLICT ("uuid")
DO UPDATE SET
"confidence" = excluded."confidence", "metadata_version" = excluded."metadata_version", "name" = excluded."name", "raw_metadata" = excluded."raw_metadata", "report_type" = excluded."report_type", "severity" = excluded."severity", "details" = excluded."details", "description" = excluded."description", "message" = excluded."message", "solution" = excluded."solution", "cve" = excluded."cve", "location" = excluded."location", "primary_identifier_id" = excluded."primary_identifier_id", "location_fingerprint" = excluded."location_fingerprint", "project_fingerprint" = excluded."project_fingerprint", "scanner_id" = excluded."scanner_id", "project_id" = excluded."project_id", "updated_at" = excluded."updated_at"
RETURNING
"id", "vulnerability_id"
Creating the new vulnerabilities
If we have NULL
vulnerability_id
values from the previous query, it means that we've created new findings which need to have new vulnerability
records to be created in the database and this query creates those records. We don't run this query if there are no new records.
database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22798
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerabilities" ("author_id", "project_id", "title", "state", "severity", "confidence", "report_type", "dismissed_at", "dismissed_by_id", "updated_at", "created_at")
VALUES (329579, 27633223, 'Regular Expression Denial of Service in debug', 1, 2, 2, 1, NULL, NULL, '2021-09-13 13:45:07.502056', '2021-09-13 13:45:07.502072'), (329579, 27633223, 'Authentication bypass via incorrect DOM traversal and canonicalization in saml2-js', 1, 2, 2, 1, NULL, NULL, '2021-09-13 13:45:07.502085', '2021-09-13 13:45:07.502089')
ON CONFLICT
DO NOTHING
RETURNING
id
Updating the existing vulnerabilities
database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22799
/*application:web,line:/ee/app/services/security/ingestion/bulk_updatable_task.rb:41:in `execute'*/
UPDATE
vulnerabilities
SET
id = map.id::bigint,
title = map.title::character varying(255),
severity = map.severity::smallint,
confidence = map.confidence::smallint,
updated_at = map.updated_at::timestamp WITH time zone
FROM (
VALUES (20788, 'Regular Expression Denial of Service in debug', 2, 2, '2021-09-10 17:12:48.118919'),
(20789, 'Authentication bypass via incorrect DOM traversal and canonicalization in saml2-js', 2, 2, '2021-09-10 17:12:48.118935')) AS map (id, title, severity, confidence, updated_at)
WHERE
vulnerabilities.id = map.id
Inserting the finding_pipeline records
Associates the findings with pipelines by creating records in a join table called vulnerability_occurrence_pipelines
.
database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22800
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_occurrence_pipelines" ("pipeline_id", "occurrence_id", "updated_at", "created_at")
VALUES (704, 25033, '2021-09-10 17:12:48.141864', '2021-09-10 17:12:48.141878'), (704, 25034, '2021-09-10 17:12:48.141884', '2021-09-10 17:12:48.141887')
ON CONFLICT
DO NOTHING
RETURNING
"id"
Associating the findings with identifiers
Associates the findings with the previously upserted identifiers by creating the in the vulnerability_occurrence_identifiers
join table.
database review: looking good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22801
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_occurrence_identifiers" ("occurrence_id", "identifier_id", "updated_at", "created_at")
VALUES (25033, 5558, '2021-09-10 17:12:48.156374', '2021-09-10 17:12:48.156389'), (25034, 5559, '2021-09-10 17:12:48.156395', '2021-09-10 17:12:48.156398'), (25034, 5560, '2021-09-10 17:12:48.156401', '2021-09-10 17:12:48.156403')
ON CONFLICT ("occurrence_id", "identifier_id")
DO UPDATE SET
"updated_at" = excluded."updated_at"
RETURNING
"id"
Loading the existing finding links
database review: OK - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22802
/*application:web,line:/ee/app/services/security/ingestion/tasks/ingest_finding_links.rb:37:in `group_by'*/
SELECT
"vulnerability_finding_links".*
FROM
"vulnerability_finding_links"
WHERE
"vulnerability_finding_links"."vulnerability_occurrence_id" IN (25033, 25034)
Creating the new finding links
database review: looks good - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22803
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_finding_links" ("vulnerability_occurrence_id", "name", "url", "updated_at", "created_at")
VALUES (25033, NULL, 'https://nodesecurity.io/advisories/534', '2021-09-10 17:12:48.179365', '2021-09-10 17:12:48.179380'), (25033, NULL, 'https://github.com/visionmedia/debug/issues/501', '2021-09-10 17:12:48.179385', '2021-09-10 17:12:48.179388'), (25033, NULL, 'https://github.com/visionmedia/debug/pull/504', '2021-09-10 17:12:48.179391', '2021-09-10 17:12:48.179393'), (25034, NULL, 'https://github.com/Clever/saml2/commit/3546cb61fd541f219abda364c5b919633609ef3d#diff-af730f9f738de1c9ad87596df3f6de84R279', '2021-09-10 17:12:48.179401', '2021-09-10 17:12:48.179403'), (25034, NULL, 'https://github.com/Clever/saml2/issues/127', '2021-09-10 17:12:48.179406', '2021-09-10 17:12:48.179409'), (25034, NULL, 'https://www.kb.cert.org/vuls/id/475445', '2021-09-10 17:12:48.179412', '2021-09-10 17:12:48.179414')
ON CONFLICT
DO NOTHING
RETURNING
"id"
Loading the existing remediations for the project with particular checksums
The checksum
column holds the SHA1 hash value of the remediation file content. We load the existing records into memory to filter the new records and create the records for them only.
database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22804
/*application:web,line:/ee/app/services/security/ingestion/tasks/ingest_remediations.rb:63:in `existing_remediations'*/
SELECT
"vulnerability_remediations".*
FROM
"vulnerability_remediations"
WHERE
"vulnerability_remediations"."project_id" = 51
AND "vulnerability_remediations"."checksum" = '\xf00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee'
Creating the remediations
Vulnerabilities::Remediation
is an ActiveRecord model with file attachment. The actual remediation content(git diff
) is stored in a file attached to the record as the remediation content can be huge. For this reason, the ingestion of remediations happens one by one to run the ActiveRecord callbacks which upload the file, etc.
database review: looks okay
- https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22805
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22806
- https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22807
- https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22808
/*application:web,line:/lib/gitlab/database.rb:244:in `block in transaction'*/
INSERT INTO "vulnerability_remediations" ("created_at", "updated_at", "summary", "file", "checksum", "project_id")
VALUES ('2021-09-13 14:25:48.300322', '2021-09-13 14:25:48.300322', 'Upgrade saml2-js', 'f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee.diff', '\xf00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee', 27633223)
RETURNING
"id"
/*application:web,line:/app/uploaders/records_uploads.rb:30:in `readd_upload'*/
DELETE FROM "uploads"
WHERE "uploads"."id" IN (
SELECT
"uploads"."id"
FROM
"uploads"
WHERE
"uploads"."uploader" = 'AttachmentUploader'
AND "uploads"."model_type" = 'Vulnerabilities::Remediation'
AND "uploads"."model_id" = 137
AND "uploads"."path" = 'uploads/-/system/vulnerabilities/remediation/file/137/f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee.diff'
ORDER BY
"uploads"."id" DESC)
/*application:web,line:/app/uploaders/records_uploads.rb:33:in `tap'*/
INSERT INTO "uploads" ("size", "path", "checksum", "model_id", "model_type", "uploader", "created_at", "mount_point")
VALUES (9748, 'uploads/-/system/vulnerabilities/remediation/file/137/f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee.diff', 'f00bc6261fa512f0960b7fc3bfcce7fb31997cf32b96fa647bed5668b2c77fee', 137, 'Vulnerabilities::Remediation', 'AttachmentUploader', '2021-09-13 14:25:48.307406', 'file')
RETURNING
"id"
/*application:web,line:/app/models/concerns/file_store_mounter.rb:19:in `update_file_store'*/
UPDATE
"vulnerability_remediations"
SET
"file_store" = 1
WHERE
"vulnerability_remediations"."id" = 137
Associating the findings with remediations
With this query, we associate the findings with the previously created remediations.
database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22809
/*application:web,line:/ee/app/services/security/ingestion/bulk_insertable_task.rb:53:in `block in result_set'*/
INSERT INTO "vulnerability_findings_remediations" ("vulnerability_occurrence_id", "vulnerability_remediation_id", "updated_at", "created_at")
VALUES (6814415, 682, '2021-09-10 17:12:48.225164', '2021-09-10 17:12:48.225181'), (6897629, 818, '2021-09-10 17:12:48.225194', '2021-09-10 17:12:48.225198')
ON CONFLICT ("vulnerability_occurrence_id", "vulnerability_remediation_id")
DO UPDATE SET
"updated_at" = excluded."updated_at"
RETURNING
id
Removing the orphan finding remediation associations
database review: looks okay - https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/6558/commands/22810
/*application:web,line:/ee/app/services/security/ingestion/tasks/ingest_remediations.rb:22:in `after_ingest'*/
DELETE FROM "vulnerability_findings_remediations"
WHERE "vulnerability_findings_remediations"."vulnerability_occurrence_id" IN (25033, 25034)
AND "vulnerability_findings_remediations"."id" NOT IN (51984, 51985)
Backend Review
Design explanation
In this MR, I am introducing 3 new sets of classes;
-
Utility classes to use during the ingestion;
- FindingMap: Encapsulates the data being used during the ingestion and supports some operations to manipulate its state. Even though it looks like a magic container(an anti-pattern), it is an argument object class as it provides methods to interact with its state.
-
FindingMapCollection: This kinda has two responsibilities;
- Acts as a factory class for the
FindingMap
objects - Acts as a collection of
FindingMap
objects by implementing the Enumerable interface
- Acts as a factory class for the
-
The service classes by the dependency order;
-
IngestReportsService: Initiates the ingestion process and updates some meta-data like;
has_vulnerabilities
,latest_pipeline_id
, etc. -
IngestReportService: Responsible for slicing down the findings to ingest for the given
security_scan
. -
IngestReportSliceService Runs the series of tasks for the given collection of
finding_map
objects. This is introduced to run ingestion in transactional batches to prevent leaving the database in an inconsistent state. I didn't want to run everything in one transaction as long-running transactions can put the burden on the database's shoulders.
-
IngestReportsService: Initiates the ingestion process and updates some meta-data like;
-
The task classes and the
IngestReportSliceService
class are implemented to achieve an easy-to-extend design(pipeline-filter pattern). With this design, when we want to implement a new feature, we can just introduce a newTaskClass
and encapsulate the logic in it without touching any other place(FindingMap
can be changed though). Here is the list of Task classes implemented in this MR;- IngestIdentifiers
- IngestFindings
- IngestVulnerabilities
- AttachFindingsToVulnerabilities
- IngestFindingPipelines
- IngestFindingIdentifiers
- IngestFindingLinks
- IngestFindingSignatures
- IngestRemediations
Benchmark between the new flow and the old one
The benchmark script can be found here: https://gitlab.com/-/snippets/2165653.
The following results are collected by running the benchmark for a pipeline of 2 security scans with 1002 findings.
#########################
Results for "ingestion"
NUMBER OF QUERIES: 282
DURATION: 438.042000
#########################
#########################
Results for "store"
NUMBER OF QUERIES: 10101
DURATION: 3476.353012
#########################
Calculating -------------------------------------
ingestion 1.000 i/100ms
store 1.000 i/100ms
-------------------------------------------------
ingestion 0.526 (± 0.0%) i/s - 3.000 in 5.732515s
store 0.034 (± 0.0%) i/s - 1.000 in 29.321573s
Comparison:
ingestion: 0.5 i/s
store: 0.0 i/s - 15.41x slower
Comparison between the new flow and the old one
From a high-level of abstraction point of view, there is no difference between the new ingestion flow and the old one;
The biggest difference is, how we organize the responsibilities, transactional ingestion & error handling;
Related to #276498 (closed)
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
- [-] I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
- [-] I have tested this MR in all supported browsers, or it's not needed.
- [-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team