Add a Scan DB model to persist status and metadata of security reports

Problem to solve

On several places, we are currently not able to distinguish if we have no report available, or a report with no vulnerabilities. This is because we are storing or fetch only a list of vulnerabilities, so no vulnerabilities cover both cases.

There are features that require deeper knowledge and having a Scan DB model could help.

Note: Previously, it was suggested this new model be called Report. This has been updated to Scan.

Further details

Benefit from having a dedicated record for scans:

Proposal

Previous proposal
  • store a new Report record in DB after each pipeline run, for each report type.
  • generated report records will stay forever (until data retention policy is defined)
  • as raw JSON artifact report may expire, we need to store with this Report record the data we want to persist and keep after the artifact is removed
  • to create a report record when there is no artifact uploaded (e.g. job failure) we need to identify which jobs in the pipeline were supposed to create a report artifact. This is being discussed here: https://gitlab.com/gitlab-org/gitlab-ee/issues/13662#note_210498199
  • consider having this record generic enough to support other types of reports (junit, performance, code quality, licenses), not only security ones.
  • consider migrating old data (parse artifacts from last X days and create corresponding report records)
  • Store a new Scan record in DB after each build, one for each Secure report type (SAST, DAST etc) exposed as an artifact during the build.
  • Each Scan record represents a scan that happened in a build. If there are multiple scans of the same type (SAST/DAST etc), then there should be multiple scan records. It is not an aggregated result.
  • When a build fails to produce the desired Secure report, a scan record will not be created in the database. This can be improved in future.
  • Generated Scan records will stay forever (until data retention policy is defined).
  • Scan records will not contain any aggregated information about the result: success, number of vulnerabilities, etc. This can be iterated on and improved later.
  • Scan records will not contain a reference to the JobArtifact, nor will it contain any information about the Job Artifact itself.
  • The scan is Secure specific. This should not affect other report types in GitLab (junit, performance, code quality, licenses).
  • Jobs with Secure report artifacts should be migrated to create Scan records. This is up for discussion if other migration strategies are discovered.
  • Once the Scan is released, a subsequent migration should occur that cleans up and completes the previous migration. Only after this step will the Scan model be able to be used.

Proposed DB model

Previous proposal The new Report model should have the following attributes (WIP):
  • report_type (sast, dast, dependency_scanning, etc.)
  • pipeline_id
  • status TBD (e.g. success, failure, missing, etc.)
  • errors
  • vulnerabily_counts TBD (total?, per severity?)
  • scan settings TBD (e.g. env variables)

The new Scan model will have the following attributes:

  • scan_type (sast, dast, dependency_scanning, etc.)
  • build_id

Links / references

https://gitlab.slack.com/archives/C8S0HHM44/p1579567350023300

Implementation plan

  • Create a Security::Scan model
  • Add a worker that on completion of a Job, populates the Security::Scan for each Job Artifact that is a Security Report
  • Add an index that will facilitate efficient migration of all of the previous Job Artifacts
  • Add a migration that will migrate Security Report Job Artifacts to Security::Scan
  • Release, following migration to make sure it has completed successfully
  • Add a migration to steal remaining migration jobs, i.e, synchronously wait for it to complete
  • Run database-lab queries to see how many records exist, and for which Secure scan types
  • Remove the temporary index created for the first migration
Edited by Cameron Swords