Spike: High level design candidates

Goal

This spike aims to investigate high-level design approaches for integrating malware package detection into the GitLab platform.

Product requirements are still being finalized. However, we have identified the following key constraints:

  • Malware vulnerability data must be restricted to users with an additional paid subscription (offered as an add-on).
  • Malware advisories should be private and not freely downloadable, unlike our current CVE advisory data.

Abbreviations

GCP - Google Cloud Platform

GLAD - GitLab Advisory Database

PMDB - Package Metadata Database

CVS - Continuous Vulnerability Scanning

DS - Dependency Scanning

SA - Service Account (refers to GCP)

Current situation for advisories

The groupvulnerability research team maintains the GitLab Advisory Database, which stores CVE advisories in a public Git repository. These advisories are also accessible via advisories.gitlab.com. Currently, the groupvulnerability research team ingests malware advisories into GLAD and treats them identically to standard CVE data, ensuring that Ultimate customers receive malware protection.

The Package Metadata Database (PMDB) handles advisory ingestion and distribution on an hourly schedule. The process consists of three stages:

  1. Ingestion: A feeder service retrieves new advisories from GLAD
  2. Processing: A processor service validates and stores advisories in the database
  3. Export: An exporter service publishes new advisories to a public bucket

Every Ultimate GitLab instance polls the public GCP bucket every 5 minutes for new advisory data. When updates are found, they are stored in the GitLab database. The Vulnerability Engine uses these advisories to match SBOM components against known CVEs. This engine powers both the Dependency Scanning analyzer and Container Scanning.

Dependency scanning analyzer flow:

  1. A Dependency Scanning job is triggered in a pipeline
  2. The analyzer identifies project components and sends them to the GitLab backend via API
  3. The Vulnerability Engine matches components against advisories and returns detected vulnerabilities
  4. The job generates a Vulnerability Report and creates SBOM files

CVS flow:

  1. When a new advisory is added to the database, an event triggers CVS
  2. CVS identifies all affected SBOM components across projects
  3. A vulnerability is created for each affected component

PMDB is owned by groupcomposition analysis team.

Finally groupsecurity insights team gets SBOM occurrence and vulnerability data from the database and present them in the Dependency List, Vulnerability Report, MR widgets, MR pipelines etc.

image

Key challenges

We need to address three primary concerns:

  • Ensuring malware advisories remain confidential
  • Restricting malware advisory access to users with an active add-on subscription
  • How to deal with offline users

Key requirements

  • The add-on can be bought both by Ultimate and Premium users
  • All add-on users should be able to access an API to query malware advisory data. These data can be fed to a 3rd party system or can be used by GitLab's Dependency Firewall in order to block malicious packages.
  • Ultimate add-on users should be able to see malware vulnerabilities in the vulnerability report and malware packages in the dependency list.

Proposal 1: Private PMDB Bucket - Store malware data in a private bucket

In this approach, the ~"group::vulnerability research" team maintains a private Git repository containing malware advisories exclusively. The Package Metadata Database (PMDB) ingests these advisories and stores them in a separate private bucket. GitLab instances with an active add-on subscription can authenticate and sync data from this private bucket. Once synced, malware advisories are stored in the GitLab database in a dedicated table. The Vulnerability Engine checks for add-on entitlement before including malware advisories in vulnerability detection.

image

GitLab instances require credentials to access the private GCP bucket. These credentials can be either JWT tokens or service accounts with read-only permissions. The primary challenge is securely distributing these credentials to instances with active add-on subscriptions.

Distribution Strategies

Option 1: Per-Instance Service Accounts Generate a unique service account for each GitLab instance. While this provides strong isolation, it introduces significant operational overhead, requires ongoing maintenance, and carries the risk of credential exposure (though with limited blast radius).

Option 2: Shared Service Account Create a single service account for all add-on instances. This simplifies credential management and key rotation, but introduces challenges in secure distribution and creates a critical dependency—if the service account is deleted in GCP, recovery becomes difficult.

Option 3: Token Broker Service (Recommended) Implement a third-party service that GitLab instances can contact using their own credentials to request short-lived JWT tokens for GCP access. Only the broker service holds the GCP service account credentials. While this approach is more secure and follows the principle of least privilege, it increases operational complexity and maintenance burden.

Proposal 2: Public PMDB Bucket - Store malware data publicly but restrict processing to add-on users

Malware advisories can be stored in a public bucket and processed only when the add-on is enabled. This can be implemented in two ways:

  • Unified Bucket with Malware Identifiers Store malware advisories in the same bucket as CVE data but with distinct identifiers. The Vulnerability Engine processes all advisories when the add-on is enabled, and filters out malware advisories when it is disabled.
  • Separate Public Bucket Maintain malware advisories in a dedicated public bucket. GitLab instances only sync from this bucket if the add-on is active.

Very easy to implement

Eliminates the need for authentication

it does not meet the requirement of keeping malware advisories confidential

Proposal 3: Private Git Repository with Conditional Sync - Maintain malware advisories in a private repository and sync only for subscribed instances

Another approach would be to skip PMDB and store the data in a private git repo. GitLab instances that have the add-on enabled should be able to be read the repo.

Authentication might look easier

Private repo should contain data in the PMDB format

New sync mechanism is required in GitLab that will kind of replicate what we have in PMDB

Malware advisories will be package metadata that will not fall into PMDB

image

/cc @rvider @ajbiton @dabeles @mmishaev

Edited by Nick Ilieskou