Commit b3a98e91 authored by Olivier Gonzalez's avatar Olivier Gonzalez 2️⃣
Browse files

Update Dependency Scanning analyzer ADD and add ADRs for vulnerability...

Update Dependency Scanning analyzer ADD and add ADRs for vulnerability scanning,  dependency resolution, and manifest scanning
parent a6fb160c
Loading
Loading
Loading
Loading
+75 −213
Original line number Diff line number Diff line
---
title: "Dependency graph export only dependency scanning analyzer"
title: "Dependency Scanning Analyzer"
status: ongoing
creation-date: "2024-08-14"
authors: [ "@hacks4oats" ]
coaches: [ "@hacks4oats" ]
dris: [ "@johncrowley", "@thiagocsf" ]
owning-stage: "~devops::secure"
authors: [ "@hacks4oats", "@gonzoyumo" ]
coaches: [ ]
dris: [ "@johncrowley", "@thiagocsf", "@nilieskou" ]
owning-stage: "~devops::application security testing"
participating-stages: []
# Hides this page in the left sidebar. Recommended so we don't pollute it.
toc_hide: true
@@ -21,229 +21,91 @@ For long pages, consider creating a table of contents.

## Summary

The dependency scanning feature is powered by a set of analyzers - `gemnasium`,
`gemnasium-maven`, and `gemnasium-python`. Associated with CI templates, these analyzers have the
responsibility of detecting supported projects, building the dependency graph or
list when needed, parsing the detected dependencies, and finally, producing a
security report with detected vulnerabilities alongside a CycloneDX SBOM that
contains the dependencies. This approach has worked well, but over time it's
become evident that the actions required to build a project's dependency graph
exports come with a lot of complexity. This complexity negatively impacts the
maintenance and creation of features, and the user experience of setting up and
maintaining the dependency scanning analyzer.
The dependency scanning feature has been historically powered by a set of analyzers - `gemnasium`, `gemnasium-maven`, and `gemnasium-python`. Associated with CI templates, these analyzers have the responsibility of detecting supported projects, building the dependency graph or list when needed, parsing the detected dependencies, and finally, producing a security report with detected vulnerabilities alongside a CycloneDX SBOM that contains the dependencies. This approach has worked well, but over time it's become evident that the actions required to build a project's dependency graph exports come with a lot of complexity. This complexity negatively impacts the maintenance and creation of features, and the user experience of setting up and maintaining the dependency scanning analyzer.

To address these challenges, we are redesigning the dependency scanning analyzer to follow a multi-tiered approach that balances accuracy with ease of use. This document outlines the overall vision and architecture of the new analyzer, while specific implementation decisions are documented in the [Architectural Decision Records (ADRs)](#decisions) section.

## Motivation

The high cost associated with building the dependency graphs/list exports
motivates us to rethink how we can structure the dependency scanning feature.
Instead of building the project dependency graphs or lists for customers, we
can delegate this responsibility to a job that runs before the analyzer does.
A build stage is a very common part of the development cycle, and generating the
dependency artifacts during this stage is a lot simpler than mapping existing
build system configuration values to the ones used by the gemnasium set of
analyzers.

In addition, build jobs can take a considerably large amount of time, so removing
the build process from dependency scanning reduces user's CI minute usage, and
further tightens the development feedback loop. Further yet, building a
project twice presents the possibility that the analyzer may build something that
does not match what is deployed. This mismatch can lead to false positives and
negatives, both of which skew a project's security status signal.

### Goals

- Customers remove the need to set up a secondary build process. Historically,
  Python and Java have required a build process.
- Reduced bug maintenance costs. A large amount of our issues surface from edge
  cases that are already handled by a customer in a previous build step, but were
  not accounted for by the analyzer's build implementation. These issues
  increase the code complexity, and cut into scheduled additions and improvements.
- Offline support by default.
- Reduced security maintenance costs. Building projects means that the analyzer
  images need to ship with pre-installed versions of supported build systems, for
  example Gradle and Maven, and runtimes like Java or Python.
- Removal of historical limitations like single project analysis for Java and
  Python monorepos.

### Non-Goals

- Supporting 3rd party SBOM generators. We can still support this in a future
  iteration.
The high cost associated with building the dependency graphs/list exports motivates us to rethink how we can structure the dependency scanning feature. Instead of building the project dependency graphs or lists on behalf of customers and within the analyzer, we can delegate this responsibility to a job that runs before the analyzer does. A build stage is a very common part of the development cycle, and generating the dependency artifacts during this stage is a lot simpler than mapping existing build system configuration values to the ones used by the gemnasium set of analyzers.

The high maintenance cost associated with building the dependency graphs/list exports has pushed us to rethink how we can structure the dependency scanning feature. Instead of building the project dependency graphs or lists on behalf of customers and within the analyzer, we can delegate this responsibility to a job that runs before the analyzer does. A build stage is a very common part of the development cycle, and generating the dependency artifacts during this stage is a lot simpler than mapping existing build system configuration values to the ones used by the gemnasium set of analyzers. So we initially considered deferring this entirely to users (see [ADR 001: Graph Export Only](./decisions/001_graph_export_only.md)) but eventually faced customer feedback and other challenges that forced us to revisit this design.

## Goals

- Provide a simplified, maintainable analyzer that reduces the attack surface and maintenance burden
- Support multiple dependency detection strategies to accommodate different project configurations
- Enable out-of-the-box dependency scanning for projects with committed lockfiles or graphfiles
- Support automatic dependency resolution for projects that require build steps
- Provide a fallback mechanism for projects without pre-generated dependency artifacts
- Reduce security maintenance costs by eliminating bundled runtimes and package managers from the analyzer image
- Removal of historical limitations like single project analysis for Java and Python monorepos

## Non-Goals

- Supporting 3rd party SBOM generators. We can still support this in a future iteration.

## Proposal

Create a new analyzer that focus on supporting only
[dependency graph exports](https://docs.gitlab.com/ee/user/application_security/terminology/#dependency-graph-export).
Document how to generate the exports with example projects, and provide
a dependency scanning CI/CD component that scans the generated artifacts.

Because of the change to SBOM-based scanning in [epic 8026](https://gitlab.com/groups/gitlab-org/-/epics/8026),
do not port over the vulnerability matching done by the Gemnasium analyzers,
as this functionality is already [planned for deprecation](https://gitlab.com/groups/gitlab-org/-/epics/14146).
The new analyzer should be based on a scratch image to reduce the attack surface introduced by container dependencies.

### Pros

- Simplified integration tests. No need to test against various permutations of
  package managers, runtime, and compiler versions.
- We should always have zero container-scanning vulnerabilities. This translates
  to a reduced workload on the engineers going through reaction rotation.
- Smaller image sizes. Fast CI job start-up, reduced network traffic.
- Simplified FIPS-compliance as the library does not use crypto libraries.
- Improved community contribution experience due to simplified permissions for
  development pipeline execution.

### Cons

- Additional documentation required with examples and guides on getting started
  with a dependency scanning for certain package managers.
- Requires the establishment of new graph export naming standards.
- Users need to configure their build jobs as instructions. It doesn't work out of the box.

## Design and implementation details

At a high level, the new dependency scanning feature will operate as follows.

```mermaid
sequenceDiagram
    autonumber
    actor user
    participant build job
    participant analyzer
    participant sbom ingestion service
    participant database

    user->>build job: triggers pipeline on default branch
    build job->>analyzer: passes all dependency graph exports generated as artifacts
    analyzer->>sbom ingestion service: converts dependency graph exports to CycloneDX SBOMs
    sbom ingestion service->>database: stores occurrences of sbom components
```

### Build job(s)

It's important to note that we cannot expect for a dependency graph export to be
checked into a project's repository. This is likely to happen in cases where
the dependency graph export does not also function as a lock file like in the
cases of `pipdeptree` and `pipenv graph` dependency graph exports. In such
cases, we will expect the build job to generate the dependency graph exports,
and for the job to store these as [job artifacts](https://docs.gitlab.com/ee/ci/jobs/job_artifacts.html).

We'll use the following naming conventions to establish a contract with users on
what file's we'll detect in cases where the dependency graph export does not
function as a lock file, and thus does not have a canonical name.

| Pattern | Description
| ------- | -----------
| `**/go.graph` | Generated via `go mod graph > go.graph`
| `**/pipenv.graph.json` | Generated via `pipenv graph --tree-json > pipenv`

It's required for the build job to run in a stage that precedes the one in which
the dependency scanning analyzer runs. This is true by default, since the
analyzer runs in the `test` stage which runs after the `build` stage.

### Analyzer

Once the build jobs complete, and the artifacts are stored, they will be passed
on to [proceeding jobs](https://docs.gitlab.com/ee/ci/jobs/job_artifacts.html#prevent-a-job-from-fetching-artifacts)
unless specifically asked not to do so. The analyzer takes advantage of this and
expects that users have configured the build jobs to pass on the artifacts using
the documented naming patterns. It will then search the entire target directory,
by default this is the project's repository, detect all supported files, parse
them, and convert them into a CycloneDX SBOM that can be utilized by the
services running in the GitLab monolith.

### Pros

- No preinstalled compilers, runtimes or system dependencies
  required.
- Small attack surface.
- Runs offline by default.

### Cons

- Graph export documentation varies in quality. Some package managers like `npm`
  document each version of the lock file, while others like `pnpm` do not.
- Java and Python projects require additional configuration since they do not
  capture graph information in their lock files by default.

## Alternative Solutions

### Require lock file, add graph information to it

One alternative solution to dependency graph exports is to make every supported
lock file a dependency graph export by default. In this scenario, we would work
directly with package manager maintainers to enhance lock files with transitive
dependency relationships, and dependency group metadata. For example, we could
work with the Gradle maintainers to add a new version of their `gradle.lockfile`
that includes parent dependencies. Our contributions would have the added
benefit of improving the experience for our users by including the necesssary
tooling out of the box, overall improving the workflow for getting started with
GitLab's dependency scanning feature.

#### Pros

- Does not require establishing new file requirements.
- Works out of the box in majority of cases. Package managers usually generate
  a lock file if one doesn't exist.

#### Cons

- Package managers tend to have large code bases that increase the onboarding
  time required.
- Lock files require domain expertise. For example, in [pnpm's issue 7685](https://github.com/pnpm/pnpm/issues/7685)
  you can see the discussion of a very specific corner case that must be handled.
- Project maintainers have their own sets of concerns that may not align with
  our own. For example, they may prioritize stability and maintenance over new
  features.
- Older versions of package managers, or build tools, would not be compatible
  with new additions.

### Rely on 3rd party CycloneDX generators
### Design Principles

- **Separation of Concerns**: Dependency detection (what components exist) is separated from vulnerability analysis (which components have vulnerabilities)
- **Minimal Image Footprint**: The analyzer image contains only the scanning logic, not build tools or runtimes
- **Flexibility**: Different projects can use different dependency detection strategies based on their needs

### Dependency detection

The new dependency scanning analyzer follows a multi-tiered approach to dependency detection, providing flexibility while maintaining accuracy.

For more details on the dependency detection approach, including the service-based resolution pattern and manifest parsing implementation, see [ADR 003: Dependency Resolution and Manifest Scanning](./decisions/003_dependency_resolution_and_manifest_scanning.md).

#### Tier 1: Lockfile/Graphfile Present (Highest Accuracy)

This approach moves the direction of composition analysis so that we interface
only with user provided `cyclonedx` CI reports from 3rd party CycloneDX generators.

#### Pros

- No CI/CD component integration testing.
- No analyzer maintenance required.

#### Cons

- Tied to the GitLab release schedule, so we can't deploy new features,
  enhancements, and bug fixes mid milestone.
- There are a lot of third party analyzers that can generate a CycloneDX report.
  Supporting all of their custom [metadata properties](https://cyclonedx.org/docs/1.5/json/#metadata_properties)
  and [component properties](https://cyclonedx.org/docs/1.5/json/#components_items_properties)
  can be challenging.
- Requires time to get up to speed with third party SBOM generator code bases.
- Proposals to the generators may be rejected. If required, we could fork the
  project, but that comes with its own set of challenges.
- Dependency graphs may be incomplete like `cyclondex_py` in some circumstances.
When projects have committed or pre-generated lockfiles or graphfiles, the analyzer consumes them directly. This provides the most accurate dependency information with minimal processing overhead.

### Generate custom dependency graph exports with package manager plugins

In the cases where package managers expose a public API, we are able to write a
plugin to generate the dependency graph in a format of our choosing. This has
been used for `gemansium-maven` dependency analysis.
#### Tier 2: Automatic Dependency Resolution

#### Pros
For projects that require build steps to generate dependency artifacts, the analyzer supports automatic dependency resolution through preceding CI jobs that run in the `.pre` stage. These jobs:

- Choice of output format.
- Can re-use the bundled `gemnasium-maven` plugins.
- Use ecosystem-native tools (Maven, Gradle, Python's `uv`) in vanilla public images
- Run the Dependency Scanning analyzer as a CI service to provide the necessary detection logic and generate the instructions for dependency resolution
- Execute these instructions to produce lockfiles or graphfiles and export them as artifacts for the DS analyzer CI job to consume

#### Cons
This approach avoids bundling multiple runtimes and package managers into the analyzer image, reducing maintenance burden and security surface area.

- Not all package managers have support for third party plugins. For example,
  `pnpm` does not have documented plugin support.
- Plugins that do not use Ruby or Go require new language expertise, which
  lead to a smaller pool of plugin maintainers, and higher review load per
  maintainer.
- Additional overhead required for the maintenance, improvements, and
  deployments of plugin projects.
#### Tier 3: Manifest Parsing Fallback (Lowest Accuracy)

When neither lockfiles nor graphfiles are available, the analyzer can parse dependency manifests directly to extract minimal dependency information. This provides basic coverage for projects without pre-generated artifacts, though with lower accuracy and completeness than lockfiles since it cannot capture transitive dependencies and the actual version used.

### Vulnerability Scanning

The analyzer integrates vulnerability scanning directly into the CI pipeline, providing immediate security feedback to developers. After generating CycloneDX SBOMs from detected dependencies, the analyzer:

1. **Uploads SBOMs to the GitLab SBOM Scan API**: The generated SBOM files are sent to GitLab's backend vulnerability scanning service
2. **Polls for scan results**: The analyzer waits for the backend to complete vulnerability analysis using the unified GitLab SBOM Vulnerability Scanner
3. **Aggregates findings**: Results from multiple SBOMs are combined into a single security report
4. **Generates security report**: A standardized GitLab dependency scanning report is produced with detected vulnerabilities

This approach maintains separation of concerns by delegating the actual vulnerability detection logic to the unified Dependency Scanning engine using the [GitLab SBOM Vulnerability Scanner](../dependency_scanning_engine/decisions/001_gitlab_sbom_vulnerability_scanner.md), while the analyzer handles orchestration and result aggregation.

For more details on the vulnerability scanning implementation, including error handling strategies, retry logic, and the concurrent processing model, see [ADR 002: Vulnerability Scanning using SBOM Scan API](./decisions/002_vulnerability_scanning.md).

## Decisions

- [ADR 001: Graph Export Only](./decisions/001_graph_export_only.md) - Documents the initial vision of supporting only lockfiles and graphfiles
- [ADR 002: Vulnerability Scanning using SBOM Scan API](./decisions/002_vulnerability_scanning.md) - Documents the decision to reintroduce vulnerability scanning capabilities within the DS analyzer
- [ADR 003: Dependency Resolution and Manifest Scanning](./decisions/003_dependency_resolution_and_manifest_scanning.md) - Documents the approach with automatic dependency resolution and manifest parsing fallback

## Appendix

- [dependency graph export](https://docs.gitlab.com/ee/user/application_security/terminology/#dependency-graph-export)
- [package manager](https://docs.gitlab.com/ee/user/application_security/terminology/#package-managers)
- [lock file](https://docs.gitlab.com/ee/user/application_security/terminology/#lock-file)

## References

- [Bring security scan results back into the Dependency Scanning CI job Epic](https://gitlab.com/groups/gitlab-org/-/work_items/17150)
- [Dependency Resolution Epic](https://gitlab.com/groups/gitlab-org/-/work_items/20461)
- [Manifest scanning Epic](https://gitlab.com/groups/gitlab-org/-/work_items/20457)
- [Dependency Scanning Engine](../dependency_scanning_engine/_index.md)
- [Dependency Scanning Engine ADR003: SBOM-based CI Pipeline Scanning](../dependency_scanning_engine/decisions/003_sbom_based_scans_for_ci_pipelines.md)
+134 −0

File added.

Preview size limit exceeded, changes collapsed.

+222 −0

File added.

Preview size limit exceeded, changes collapsed.

+314 −0

File added.

Preview size limit exceeded, changes collapsed.

+1 −1
Original line number Diff line number Diff line
@@ -5,7 +5,7 @@ creation-date: "2025-08-26"
authors: [ "@gonzoyumo" ]
coaches: [ "@mbenayoun", "@theoretick" ]
dris: [ "@nilieskou" ]
owning-stage: "~devops::secure"
owning-stage: "~devops::application security testing"
participating-stages: []
toc_hide: true
---