POC: Restructure Security Reports

Note

The discussions and work around this proposal continue to evolve. Here are the up-to-date links:

BrownBag Session: Restructure Security Report Schemas: Proof of Concept
- Video, Slides, Agenda/Questions Google Doc
gitlab-org/gitlab - Draft: Adds generic security report type (created for the BrownBag). Split into:
- Draft: Adds security report details field rendering
security-report-schemas - Draft: Adds generic security report schema (created for the BrownBag)

Summary

This issue is the result of some thoughts that have occurred to me while working with the security-report-schemas project to add the fuzz-coverage schema (and the related draft refactor MR).

I realize this is a very high-impact area of code over which many discussions have been had.

However, I feel we are making the code related to handling analyzers and security reports more difficult to work with, less flexible, and harder to change, rather than moving in the opposite direction where we can move faster and easier. This issue is a proposal for how we can iteratively simplify and refactor the handling of security reports.

Improvements

Top-level points of change in this proposal:

The individual security reports should
- be reduced down to one schema, a "security" report schema
- have a free-form, informational details field that is composed of typed objects and is associated with each vulnerability occurrence.
- have a free-form, informational metadata field that is composed of typed objects and is associated with the project itself (a top-level field in the security report). This would let us implement analyzers whose sole purpose is to enhance the effectiveness of other analyzers by providing additional context or information about the project. This brainstorming issue Make the output of one analyzer influence the output of another analyzer explores how this might work.
- not be restricted to known analyzer types (in other words, sast, dast, etc.)
- declare their intention in identifying vulnerability uniqueness
The vulnerability details frontend should
- know how to render the typed, free-form details field. proof-of-concept MR
- test against the free-form, typed object schemas
The rails backend should
- Not require a specific type of security report - only that it conform to the "security report" schema
- should test against the root "security report" schema
The DB
- vulnerability_occurrences and related database tables should only contain the minimal fields that all security reports must provide. These should not be analyzer-specific and should only be fields that we need to directly index and search on. This is mostly the case already, minus the location field.

Benefits

If the above points are implemented:

Integrating a new analyzer would require zero database, rails, frontend, or schema changes
We are not restricting ourselves to known analyzer types
Changes to the database would be rare
Changes to the "security report" schema would be rare (or at least not analyzer-specific)
The schema would be the final say on the security report format
Analyzers are free to include any information in the details field to provide additional context or insight to the user.
All analyzers benefit as the rendering of the free-form core data types improves
Adding advanced, data-specific core data types will benefit all analyzers, not only specific types. For example, an analyzer that renders information about an HTTP endpoint could be used by SAST, DAST, or any analyzer that includes information about HTTP endpoints in their report.
Unique vulnerabilities will be more easily tracked as the code changes

Reasoning

1. The Analyzer Report Schemas are too Specific

The database schema for vulnerabilities/vulnerability occurrences should be kept to a bare minimum (it mostly is already). The bar for adding core fields that are required and strongly-typed should be very high and rare to pass. To me this means they are added only if we need to index and search directly on the field in an analyzer-agnostic way.

For the security report schemas, having too-specific schemas for each supported analyzer type restricts and impedes us more than it helps us. Most fields in the schema are purely informational. It also leads to the ruby code needing to implement analyzer-specific processing in order to interpret the meaning of the analyzer's report. Using the the location field to determine uniqueness is an example of this.

2. Schemas Should Drive the Structure

Building off of the first point, the security report schemas should be the ultimate reference, not the code. It is something we can point a third-party to and is the contract between the analyzers and GitLab.

The rails and UI code should be tested against a versioned schema, and the analyzer's output should also be tested against the schema. Semantic versioning of the schemas should ensure backwards compatibility between minor versions, with compatibility-breaking changes requiring major version increments. The rails/frontend should only use analyzer versions that produce security reports with supported version numbers.

In other words, analyzers may be developed and released that produce security reports with versions beyond what is supported by GitLab. However, only analyzers with supported security report versions should be used in GitLab. (Side note - GitLab could provide its supported schema range to the analyzer, and the analyzer would then produce a report with its highest supported version number that falls within GitLab's expected range. The job would fail if no overlap in supported version ranges exists.

Reiterating the first point, the schemas (or schema, see the next point) should be minimal. The only required fields should probably be the fields required in the database tables.

3. Analyzer-Specific Data Should be Free Form, Only One Schema Should Exist

If we have a minimal, core set of required fields in the schema (that are probably mapped directly to the DB table), all other fields are informational to the user.

We can define core data types that the free-form data must be composed of. Analyzers should be free to construct their specific data how they wish. This allows analyzers to re-use core data-types in ways we might not have anticipated.

Taking this thought further, the concept of SAST, DAST, secret detection, and other analyzer-specific reports could go away. Only a "security" report would remain. We would keep the requirements for metadata about the scanner (name, type, vendor, version) for informational purposes, but do not need to enforce that the scanner type be from a hard-coded list. I think having this strict requirement is limiting us to what we currently are aware of, instead of making future unknown options more accessible.

If all analyzer-specific data was stored using the core data-types in a free-form field, we could reduce our security report schemas down to being a single schema. The single schema would define the required fields' definitions and core data types' definitions that the free-form field would be composed of.

4. How Vulnerability Uniqueness is Derived Should Not Be Assumed

This is tied to the rails backend, the database schema and the location field in the security-report-format JSON.

Currently vulnerabilty uniqueness is calculated based on a hash of a value derived from the contents of the location JSON object in the security report. Processing occurs in the rails code to calculate the hash and compare it to other vulnerabilities.

This should change to provide analyzers a way to declare their intention in calculating unique vulnerabilities that doesn't require the analyzer to have the full context of the project.

This issue goes into more detail.

5. Default Case: New Analyzers Require No Rails, Schema, Frontend, or DB changes

If we have core, analyzer-agnostic fields defined in the security schema, and a way to display those data types to the user, adding a new analyzer should require no rails, db, schema, or frontend changes. All analyzer-specific data will be composed of the core data types, which the UI will already know how to render.

Risks

This will involve code changes in many locations: DB, frontend, rails, security-report-schemas, and each analyzer. This is itself a risk and should be performed incrementally.

Backwards-compatibility with analyzer-specific report types will need to be maintained, at least for the current major version of GitLab (13.X).

Involved Components

GitLab
- db/structure.sql (maybe? TBD)
- db/migrate/new_migration
- ee/app/assets/javascripts/vue_shared/security_reports/
- ee/app/helpers/vulnerabilities_helper.rb
- ee/app/models/vulnerabilities/occurrence.rb
- ee/app/serializers/vulnerabilities/finding_entity.rb
- ee/lib/gitlab/ci/parsers/security/
- ee/lib/gitlab/ci/reports/security/
- ee/app/models/ee/ci/job_artifact.rb
- doc
Security Report Schemas
analyzers

Intended Side Effects

Vulnerabilities detail UI will be dynamically created
Easier maintenance
Easier to integrate new types of analyzers
Easier to conform to the security report schema
Ability to have non-vulnerability-producing analyzers - analyzers may exist only to provide additional metadata that enhances the effectiveness of other analyzers.
Vulnerability uniqueness will work better among changing code

Proposed Iterations

1.A Add the `details` Field to Base Security Report

This proof-of-concept MR (previous one) demonstrates how this might work on the frontend/db/rails side of things.

Schema changes would need to be made to the security_report_format.json root schema to define the core object types and the optional details field.

The frontend and rails code could start testing against the core data types used by the details field.

1.B Add the `metadata` Top-Level Field in the Security Report

(copied from the improvements section above) A free-form, informational metadata field could be added to the security report that is composed of typed objects and is associated with the project itself (a top-level field in the security report). This would let us implement analyzers whose sole purpose is to enhance the effectiveness of other analyzers by providing additional context or information about the project. This brainstorming issue Make the output of one analyzer influence the output of another analyzer explores how this might work.

The MVC version of this field could simply be a free-form, typed, optional field that exists in the security report schema. It would only be used during a pipeline and would not need to be persisted in the database.

Persisting this data in the database and displaying it to the user would be very interesting and could enable iterative code scanning that reused metadata from previous analysis pipelines. This should be done in a future iteration.

1.C Add a New `uniqueness` Required Field for Each Vulnerability

This would require schema changes to the base security-report-schema, as well as rails changes See this issue for a more in-depth discussion.

2. Add a New Generic "security" Report Type

This report type would be what the analyzers will eventually migrate to. No analyzer-specific fields would be required in the schema. This would require changes in:

security-report-schemas
GitLab

In the UI, the scanner name/type, vulnerability-information, etc should all show up the same as before. The user doesn't know or care which report type generated the vulnerabilities - only the Scanner Type and Severity are shown as filter and sort options.

This may involve disassociating the scanner type from the report type in the rails code.

3. Minimize the Database Schema for Vulnerability Occurrences

The only field that is currently report-specific is the location_fingerprint field. I believe we have opportunity here to improve how we track vulnerabilities in evolving codebases.

4. Migrate Analyzers to Use the New "security" Report

Analyzers should migrate their generated reports to the "security" report format, with all informational fields appearing in the details free-form field.

This would need to be done for each analyzer. At this stage we would also have an opportunity to add additional information from the analyzer into the details field that previously didn't fit the scanner-specific report type. For example, suppose the nodejs SAST analyzer has additional context that could be useful to the user that didn't previously fit into the sast report schema.

Edited Oct 28, 2020 by Matt Wilson