Dynamic Dependency Scanning jobs

Problem to solve

Today Dependency Scanning (including CycloneDX SBOM generation) is implemented as a fixed set of CI jobs that rely on predefined Docker images. This approach has important limitations that blocks popular feature enhancements.

It is not possible to detect and scan multiple Java projects or Python projects of a monorepo. This is because a job can only scan one of these projects. Epic: Allow all Java and Python files to be scanned (&12315 - closed)
Scanning jobs can't dynamically switch to the Docker image that's most compatible with the repository, based on the build dependencies. For instance, it can't switch to python:3.11 (or any image based on it) after detecting that the project relies on Python 3.11. TODO: link to relevant issue(s)
Users can't override execution rules without breaking the default behaviors. Issue: Improve extensibility of SAST, Dependency Scann... (#218444)
1. Job is triggered if and only if compatible files are detected.
2. Job switches to the FIPS image based on predefined CI variables.
In particular, users can't easily change job rules so that scanning jobs are only triggered when dependency files (detected or manually set) change.

(This applies to CycloneDX SBOM generation as well.)

Reminder: By design it's not possible to alter a CI pipeline and add new jobs to it after it's been created.

Challenges

The solution must be compatible with Scan Execution Policies.
It must be backward compatible, and with a reasonable migration path.
It should be consistent across all the product categories of Secure, and possibly beyond.
Users should be able to switch manual, and to force the following:
- files to be scanned
- Docker image used for the scan
- command that runs the scan
- execution rules

Proposals

The proposals fall into the following categories:

Rely on existing tools: dynamic child pipelines.
- Pros: No change to the CI/CD YAML syntax or to the backend.
- Cons: It doesn't seem customizable.
Add keywords to the CI/CD YAML syntax.
- Pros: Customizable, user friendly, possibly backward compatible, ships w/ GitLab itself.
- Cons: Significant backend change.
Run scans out of a CI pipeline.
- Pros: Very flexible.
- Cons: Large change.

Proposal A

Extend parallel:matrix keyword of the CI/CD YAML syntax to create a matrix of Dependency Scanning jobs based on what the backend has detected.

Pros

customizable

Cons

It's assumed that the backend can detect everything w/o running a CI job.

Proposal B

Add a new dependency-scanning keyword to the CI/CD YAML syntax. This represents the Dependency Scanning jobs, and is expanded to multiple jobs based on what the backend has detected.

Pros

customizable

Cons

Compared to Proposal A, it's a bigger syntax change.
It's assumed that the backend can detect everything w/o running a CI job.

Proposal C

Replace CI templates with CI config generators. Generators would be included just like templates, but their contents would be generated by the backend.

Pros

Compare to A & B, it doesn't a keywords specific to Dependency Scanning to the YAML syntax.
Overall this is very generic.

Cons

Compare to A & B, it's possibly a much larger change.
Jobs can be customized as long as users can predict how jobs are named automatically. It relies on conventions and is less explicit than A & B.

Proposal D

Extend Scan Execution Policies' processor. SEP would delegate to a CI config generator specific to Dependency Scanning.

The DS CI config generator would be similar to the one proposed in proposal C, but we wouldn't have to extend the CI/CD YAML syntax to allow users to include it; this could be implemented later on.

Pros

We reuse code, and it's a much smaller change than Proposal E (running scans out of a pipeline).

Cons

Users must enable Scan Execution Policies. Right now this involves creating a project to keep the policies, so it's not a lightweight process.
We radically change the scope of the SecurityOrchestrationPolicies::Processor. It would support features owned by groupsecurity policies and by groupcomposition analysis. This might have a negative impact on velocity.

Proposal E

Introduce a detection job that generates a CI config, and trigger a dynamic child pipeline.

This depends on #421564.

Pros

Fits in the CI.

Cons

not customizable
not backward compatible
not the best visualization

Proposal F

Run Dependency Scanning out of a pipeline, possibly using the CI infrastructure.

Pros

It removes many technical limitations.

Cons

There's a lot to design and implement. We essentially start from scratch.
UI needs to be defined.
not compatible with Scan Execution Policies
not customizable
not backward compatible

Proposal G

The proposal is twofold:

Introduce new predefined CI/CD variables.
- *_DEPENDENCY_FILES: JAVA_DEPENDENCY_FILES, PYTHON_DEPENDENCY_FILES, etc.
- *_VERSION: JAVA_VERSION, PYTHON_VERSION, etc.
Implement Support variable expansion in parallel matrix j... (#381603) or Parallel CI jobs from file globs (#356273).

It's then possible to have the following jobs using the parallel:matrix syntax:

a gemnasium-python-dependency_scanning job per item of PYTHON_DEPENDENCY_FILES, and using an image named after PYTHON_VERSION.
a gemnasium-maven-dependency_scanning job per item of JAVA_DEPENDENCY_FILES, and using an image named after JAVA_VERSION.
a gemnasium-dependency_scanning job for all other dependency files; we would use variable expansion like $GO_DEPENDENCY_FILES,$RUBY_DEPENDENCY_FILES,...

Pros

no changes to the CI/CD YAML syntax

Cons

It's highly couple to the three existing scanning jobs.
Images have to be named after JAVA_VERSION and PYTHON_VERSION.
It doesn't scale in complexity. For instance, let's imagine that we want to select the image based on the language version AND the package manager version. Then we would have to maintain a matrix of images, instead of implementing a simple switch in the backend.

Proposal G2

Similar to Proposal G, but introduce predefined CI variables that contain the image name to be used by each analyzer:

GEMNASIUM_IMAGE
GEMNASIUM_MAVEN_IMAGE
GEMNASIUM_PYTHON_IMAGE

Pros

Compared to G, this is more flexible.

Cons

Compared to G, the new predefined CI variables are coupled to the analyzers.

Edited Jan 31, 2024 by Fabien Catteau