Enhancements to security report validation (#8900) · Epics · GitLab.org

Enhancements to security report validation

## Purpose With the introduction of schema validation and enforcement on security report artifacts, we are now able to better prevent malformed finding records or otherwise invalid data from being ingested and stored. One side effect of this enforcement is that a single invalid finding in a large report will cause the entire report to be rejected or truncated. We want to strike a better balance between outright rejection and continuing to allow invalid data reach the database. The general idea is to create "soft" enforcements of certain schema constraints. Unfortunately, the current [JSON schema validation](https://json-schema.org/draft/2020-12/json-schema-validation.html) spec doesn't support partial validations. This means we can't exclude only invalid finding records while allowing the rest to pass. To get around this limitation we are proposing our own Ruby gem that handles the validation to allow for such soft enforcement. With this, as long as the main structure of a security report is valid, we can exclude invalid results and inform the end user of which records were skipped. For example, a finding with a particular field that exceeds the specified max length in the schema spec will be skipped but won't prevent the rest of the valid findings in the report from being ingested. ## Details 1. We should look at this holistically, not just for vulnerability links (ie `name` and `url`) 1. We want to document and improve visibility on how validation exceptions are handled. 1. For example, the documentation explains how findings are deduplicated, but we still get many inquiries about it and we have to point ppl out to the docs. This information could be visible in the pipeline security report (e.g. SAST: 100 findings, 15 duplicate etc) 1. We should explain whether an entry was ignored, truncated, or modified by the ingestion process. 1. The new [rubygem for validation](https://gitlab.com/gitlab-org/security-products/security-report-schemas-ruby) may help us by allowing us to define custom schema rules. This way we can have levels of enforcement: 1. JSON schema validation on the whole report. For data that absolutely needs to be present and in the right format. The whole report is rejected if this fails. 1. custom schema validation of certain fields. For fields that have a "soft" `maxLength` - meaning the validation would produce a warning, but the value would be truncated and ingested. 1. semantic validation of certain fields. For fields require logic processing. As above, they would produce a warning but may or not be ingested. For example, `url` values can't be truncated as this would probably break the URI, but we could ignore these invalid entries and still ingest the valid entries. 1. On point of conditionally ingesting or modifying reports: we need to be meticulous and deliberate on what we choose to alter. We may cause greater surprise to users if their vulnerability management data is unexpectedly different from the findings in the raw reports.

epic