Reduce sbom_occurrences table growth

Why are we doing this work

Currently, we are essentially using sbom_occurrences as an append-only log and this doesn't scale. When enabling the cyclonedx_sbom_ingestion feature flag, we saw approximately 1 million sbom_occurrences created per hour. This is because we're storing all the occurrences created in each pipeline. We should abandon the append-only log approach.

Instead, we should add columns tracking when the dependency appeared. For example:

column type description
latest_default_branch_pipeline_id bigint Latest pipeline on the default branch where this dependency appeared
present_on_default_branch bool True if this dependency is currently present on the default branch

This allows us to fullfil the current project requirements, which is to provide the same functionality as the existing dependency list. The dependencies which appear in the list will be the ones that have present_on_default_branch = true and the latest_default_branch_pipeline_id can be used to show the last successful scan. In the future, it we would like to show dependencies on other branches, we will need to come up with a different solution for that.

Relevant links

Non-functional requirements

  • Documentation:
  • Feature flag:
  • Performance:
    • Normalising will increase the amount of joins we have to do. Test the potential impact.
  • Testing:

Implementation plan

  1. Write database migrations to:
    1. Delete all existing rows from sbom_occurrences (DB lab says this would take around 7 seconds)
    2. Remove index_sbom_occurrences_on_ingestion_attributes
    3. Create a new unique index on %i[project_id component_id component_version_id source_id]
  2. Update the ingestion tasks:
    1. Update the ingestion pipeline so that Sbom::Ingestion::IngestReportsService receives a list of all the sbom_occurrence ids (ingested_ids) that were present in the ingested reports.
    2. Pass the ingested_ids to a new service Sbom::Ingestion::PostIngestionService
  3. Have the PostIngestionService iterate through all the occurrences in the given project.
  4. Delete the records which aren't present in ingested_ids in batches of 100.
  5. Update the pipeline_id and commit_sha of the already-existing records to that of the current pipeline

Verification steps

Auto-Summary 🤖

Discoto Usage

Points

Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive) point:. For example, the following are all valid points:

  • #### POINT: This is a point
  • * point: This is a point
  • + Point: This is a point
  • - pOINT: This is a point
  • point: This is a **point**

Note that any markdown used in the point text will also be propagated into the topic summaries.

Topics

Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.

Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive) topic:. For example, the following are all valid topics:

  • # Topic: Inline discussion topic 1
  • ## TOPIC: **{+A Green, bolded topic+}**
  • ### tOpIc: Another topic

Quick Actions

Action Description
/discuss sub-topic TITLE Create an issue for a sub-topic. Does not work in epics
/discuss link ISSUABLE-LINK Link an issuable as a child of this discussion

Last updated by this job

Discoto Settings
---
summary:
  max_items: -1
  sort_by: created
  sort_direction: ascending

See the settings schema for details.

Edited by Brian Williams