POC: Restructure Security Reports
Note
The discussions and work around this proposal continue to evolve. Here are the up-to-date links:
- BrownBag Session: Restructure Security Report Schemas: Proof of Concept
- gitlab-org/gitlab - Draft: Adds generic security report type (created for the BrownBag). Split into:
- security-report-schemas - Draft: Adds generic security report schema (created for the BrownBag)
Summary
This issue is the result of some thoughts that have occurred to me while working with the security-report-schemas project to add the fuzz-coverage schema (and the related draft refactor MR).
I realize this is a very high-impact area of code over which many discussions have been had.
However, I feel we are making the code related to handling analyzers and security reports more difficult to work with, less flexible, and harder to change, rather than moving in the opposite direction where we can move faster and easier. This issue is a proposal for how we can iteratively simplify and refactor the handling of security reports.
Improvements
Top-level points of change in this proposal:
- The individual security reports should
- be reduced down to one schema, a "security" report schema
- have a free-form, informational
details
field that is composed of typed objects and is associated with each vulnerability occurrence. - have a free-form, informational
metadata
field that is composed of typed objects and is associated with the project itself (a top-level field in the security report). This would let us implement analyzers whose sole purpose is to enhance the effectiveness of other analyzers by providing additional context or information about the project. This brainstorming issue Make the output of one analyzer influence the output of another analyzer explores how this might work. - not be restricted to known analyzer types (in other words,
sast
,dast
, etc.) - declare their intention in identifying vulnerability uniqueness
- The vulnerability details frontend should
- know how to render the typed, free-form
details
field. proof-of-concept MR - test against the free-form, typed object schemas
- know how to render the typed, free-form
- The rails backend should
- Not require a specific type of security report - only that it conform to the "security report" schema
- should test against the root "security report" schema
- The DB
-
vulnerability_occurrences
and related database tables should only contain the minimal fields that all security reports must provide. These should not be analyzer-specific and should only be fields that we need to directly index and search on. This is mostly the case already, minus the location field.
-
Benefits
If the above points are implemented:
- Integrating a new analyzer would require zero database, rails, frontend, or schema changes
- We are not restricting ourselves to known analyzer types
- Changes to the database would be rare
- Changes to the "security report" schema would be rare (or at least not analyzer-specific)
- The schema would be the final say on the security report format
- Analyzers are free to include any information in the
details
field to provide additional context or insight to the user. - All analyzers benefit as the rendering of the free-form core data types improves
- Adding advanced, data-specific core data types will benefit all analyzers, not only specific types. For example, an analyzer that renders information about an HTTP endpoint could be used by SAST, DAST, or any analyzer that includes information about HTTP endpoints in their report.
- Unique vulnerabilities will be more easily tracked as the code changes
Reasoning
1. The Analyzer Report Schemas are too Specific
The database schema for vulnerabilities/vulnerability occurrences should be kept to a bare minimum (it mostly is already). The bar for adding core fields that are required and strongly-typed should be very high and rare to pass. To me this means they are added only if we need to index and search directly on the field in an analyzer-agnostic way.
For the security report schemas, having too-specific schemas for each supported analyzer type restricts and impedes us more than it helps us. Most fields in the schema are purely informational. It also leads to the ruby code needing to implement analyzer-specific processing in order to interpret the meaning of the analyzer's report. Using the the location field to determine uniqueness is an example of this.
2. Schemas Should Drive the Structure
Building off of the first point, the security report schemas should be the ultimate reference, not the code. It is something we can point a third-party to and is the contract between the analyzers and GitLab.
The rails and UI code should be tested against a versioned schema, and the analyzer's output should also be tested against the schema. Semantic versioning of the schemas should ensure backwards compatibility between minor versions, with compatibility-breaking changes requiring major version increments. The rails/frontend should only use analyzer versions that produce security reports with supported version numbers.
In other words, analyzers may be developed and released that produce security reports with versions beyond what is supported by GitLab. However, only analyzers with supported security report versions should be used in GitLab. (Side note - GitLab could provide its supported schema range to the analyzer, and the analyzer would then produce a report with its highest supported version number that falls within GitLab's expected range. The job would fail if no overlap in supported version ranges exists.
Reiterating the first point, the schemas (or schema, see the next point) should be minimal. The only required fields should probably be the fields required in the database tables.
3. Analyzer-Specific Data Should be Free Form, Only One Schema Should Exist
If we have a minimal, core set of required fields in the schema (that are probably mapped directly to the DB table), all other fields are informational to the user.
We can define core data types that the free-form data must be composed of. Analyzers should be free to construct their specific data how they wish. This allows analyzers to re-use core data-types in ways we might not have anticipated.
Taking this thought further, the concept of SAST, DAST, secret detection, and other analyzer-specific reports could go away. Only a "security" report would remain. We would keep the requirements for metadata about the scanner (name, type, vendor, version) for informational purposes, but do not need to enforce that the scanner type be from a hard-coded list. I think having this strict requirement is limiting us to what we currently are aware of, instead of making future unknown options more accessible.
If all analyzer-specific data was stored using the core data-types in a free-form field, we could reduce our security report schemas down to being a single schema. The single schema would define the required fields' definitions and core data types' definitions that the free-form field would be composed of.
4. How Vulnerability Uniqueness is Derived Should Not Be Assumed
This is tied to the rails backend, the database schema and the location
field
in the security-report-format JSON.
Currently vulnerabilty uniqueness is calculated based on a hash of a value
derived from the
contents of the location
JSON object in the security report. Processing
occurs in the rails code to calculate the hash and compare it to other
vulnerabilities.
This should change to provide analyzers a way to declare their intention in calculating unique vulnerabilities that doesn't require the analyzer to have the full context of the project.
This issue goes into more detail.
5. Default Case: New Analyzers Require No Rails, Schema, Frontend, or DB changes
If we have core, analyzer-agnostic fields defined in the security schema, and a way to display those data types to the user, adding a new analyzer should require no rails, db, schema, or frontend changes. All analyzer-specific data will be composed of the core data types, which the UI will already know how to render.
Risks
This will involve code changes in many locations: DB, frontend, rails, security-report-schemas, and each analyzer. This is itself a risk and should be performed incrementally.
Backwards-compatibility with analyzer-specific report types will need to be maintained, at least for the current major version of GitLab (13.X).
Involved Components
-
GitLab
-
db/structure.sql
(maybe? TBD) db/migrate/new_migration
ee/app/assets/javascripts/vue_shared/security_reports/
ee/app/helpers/vulnerabilities_helper.rb
ee/app/models/vulnerabilities/occurrence.rb
ee/app/serializers/vulnerabilities/finding_entity.rb
ee/lib/gitlab/ci/parsers/security/
ee/lib/gitlab/ci/reports/security/
ee/app/models/ee/ci/job_artifact.rb
doc
-
- Security Report Schemas
- analyzers
Intended Side Effects
- Vulnerabilities detail UI will be dynamically created
- Easier maintenance
- Easier to integrate new types of analyzers
- Easier to conform to the security report schema
- Ability to have non-vulnerability-producing analyzers - analyzers may exist only to provide additional metadata that enhances the effectiveness of other analyzers.
- Vulnerability uniqueness will work better among changing code
Proposed Iterations
details
Field to Base Security Report
1.A Add the This proof-of-concept MR (previous one) demonstrates how this might work on the frontend/db/rails side of things.
Schema changes would need to be made to
the security_report_format.json
root schema to define the core object
types and the optional details
field.
The frontend and rails code could start testing against the core data
types used by the details
field.
metadata
Top-Level Field in the Security Report
1.B Add the (copied from the improvements section above) A free-form, informational metadata
field could be added to the security report that is composed of typed objects and is associated with the project itself (a top-level field in the security report). This would let us implement analyzers whose sole purpose is to enhance the effectiveness of other analyzers by providing additional context or information about the project. This brainstorming issue Make the output of one analyzer influence the output of another analyzer explores how this might work.
The MVC version of this field could simply be a free-form, typed, optional field that exists in the security report schema. It would only be used during a pipeline and would not need to be persisted in the database.
Persisting this data in the database and displaying it to the user would be very interesting and could enable iterative code scanning that reused metadata from previous analysis pipelines. This should be done in a future iteration.
uniqueness
Required Field for Each Vulnerability
1.C Add a New This would require schema changes to the base security-report-schema, as well as rails changes See this issue for a more in-depth discussion.
2. Add a New Generic "security" Report Type
This report type would be what the analyzers will eventually migrate to. No analyzer-specific fields would be required in the schema. This would require changes in:
- security-report-schemas
- GitLab
In the UI, the scanner name/type, vulnerability-information, etc should all show up the same as before. The user doesn't know or care which report type generated the vulnerabilities - only the Scanner Type and Severity are shown as filter and sort options.
This may involve disassociating the scanner type from the report type in the rails code.
3. Minimize the Database Schema for Vulnerability Occurrences
The only field that is currently report-specific is the location_fingerprint
field. I believe we have opportunity here to improve how we track vulnerabilities
in evolving codebases.
4. Migrate Analyzers to Use the New "security" Report
Analyzers should migrate their generated reports to the "security" report
format, with all informational fields appearing in the details
free-form field.
This would need to be done for each analyzer. At this stage we would also have
an opportunity to add additional information from the analyzer into
the details
field that previously didn't fit the scanner-specific report type.
For example, suppose the nodejs SAST analyzer has additional context that
could be useful to the user that didn't previously fit into the sast
report
schema.